科学文献MCQA基准构建与评估

cs.AI updates on arXiv.org 09月16日

科学文献MCQA基准构建与评估

本文提出一种从科学论文大语料库中生成多项选择题（MCQA）基准的模块化框架，实现PDF解析、语义块提取、问题生成和模型评估的自动化。通过案例研究，本文从22,000篇开放获取论文中生成16,000个MCQ，并评估一系列小语言模型在此基准上的表现。

arXiv:2509.10744v1 Announce Type: cross Abstract: As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

科学文献 MCQA基准语言模型评估 PDF解析语义块提取

相关文章

EleutherAI Presents Language Model Evaluation Harness (lm-eval) for Reproducible and Rigorous NLP Assessments, Enhancing Language Model Evaluation

Application Task Driven: LLM Evaluation Metrics in Detail

This AI Paper by Allen Institute Researchers Introduces OLMES: Paving the Way for Fair and Reproducible Evaluations in Language Modeling

Michelangelo: An Artificial Intelligence Framework for Evaluating Long-Context Reasoning in Large Language Models Beyond Simple Retrieval Tasks

科学家发布大规模科学文档基准数据集，助力解决高质量科学语料稀缺问题

OpenAI 发布 MMMLU 数据集：更广、更深评估 AI 模型，支持简体中文

Meet TurtleBench: A Unique AI Evaluation System for Evaluating Top Language Models via Real World Yes/No Puzzles

Building on evaluation quicksand

UVID In TOPCon: Leadmicro’s Perspective

6大RAG知识库PDF文档处理工具对比！