cs.AI updates on arXiv.org 10月02日
MathConstruct:针对构造性证明的新数学基准
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了MathConstruct,一个包含121个来自各类数学竞赛的难题的新基准,旨在解决现有数学基准的局限性。该基准专注于构造性证明,适用于评估大型语言模型在数学领域的性能。

arXiv:2502.10197v2 Announce Type: replace Abstract: While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce MathConstruct, a new benchmark of 121 challenging problems sourced from various math competitions, which targets constructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 60% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MathConstruct 构造性证明 大型语言模型 数学基准 性能评估
相关文章