LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

cs.AI updates on arXiv.org 08月22日

LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

本文提出LongRecall，一种针对机器生成文本完整性的评估框架，通过三阶段分解答案，进行词汇和语义过滤，并通过结构化推理验证答案的准确性，有效提升机器生成文本的召回率。

arXiv:2508.15085v1 Announce Type: cross Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LongRecall 文本完整性评估框架召回率机器生成文本

相关文章

一篇OpenAI、微软等系统性Prompt技术报告

Enhancing Vision-Language Models: Addressing Multi-Object Hallucination and Cultural Inclusivity for Improved Visual Assistance in Diverse Contexts

A New AI Study from MIT Shows Someone’s Beliefs about an LLM Play a Significant Role in the Model’s Performance and are Important for How It is Deployed

Large language models don’t behave like people, even though we may expect them to

中科大/华为诺亚出手！芯片性能≠布局评分，EDA物理设计框架全面开源

RAGChecker: A Fine-Grained Evaluation Framework for Diagnosing Retrieval and Generation Modules in RAG

This AI Paper from MIT Explores the Complexities of Teaching Language Models to Forget: Insights from Randomized Fine-Tuning

LLM-CI: A New Machine Learning Framework to Assess Privacy Norms Encoded in LLMs

VIS 2024最佳论文 | VisEval：推动自然语言生成可视化的全新评估框架

Google DeepMind Introduces Omni×R: A Comprehensive Evaluation Framework for Benchmarking Reasoning Capabilities of Omni-Modality Language Models Across Text, Audio, Image, and Video Inputs