LongWeave：平衡真实性与可验证性的长文本生成评估

cs.AI updates on arXiv.org 10月29日 12:27

LongWeave：平衡真实性与可验证性的长文本生成评估

本文提出LongWeave，一种平衡真实世界与可验证性的长文本生成评估方法，通过定义真实场景中的可验证目标，生成相应的查询、文本材料和约束，确保任务既真实又可客观评估，支持多种任务和可定制输入输出长度。

arXiv:2510.24345v1 Announce Type: cross Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LongWeave 长文本生成评估方法真实世界可验证性

相关文章

[Interim research report] Evaluating the Goal-Directedness of Language Models

OpenAI 超级对齐最新研究：通过证明者-验证者游戏提高 LLM 的可读性

Open Source Automated Interpretability for Sparse Autoencoder Features

BioLP-bench. Measuring understanding of AI models of biological lab protocols.

现在越来越觉得功夫在诗外炒股应该多关注身边的真实世界了解现象背后的本质仔细做好逻辑推演比如最近现实里two首防挂牌激增伐π防同比去年增加68% 冰山指...

Accuracy evaluation framework for Amazon Q Business

Arkstream Capital：为什么布局可验证计算和投资 Space And Time

GPT-4o mini排名雪崩，大模型竞技场规则更新，奥特曼刷分小技巧无效了

We Should Try to Directly Measure the Value of Scientific Papers

6连板双成药业披露重大资产重组进展：标的公司奥拉股份估值将明显低于前次融资估值100亿元