小型LLM回答一致性研究

cs.AI updates on arXiv.org 09月15日

小型LLM回答一致性研究

本文研究了小型LLM在回答重复问题时的一致性，对比了不同模型、参数设置下的表现，并提出新的分析工具。

arXiv:2509.09705v1 Announce Type: cross Abstract: This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

小型LLM 回答一致性模型比较分析工具

相关文章

宝马手机应用推出新功能：用数据说服燃油车主电动汽车续航够用

Claude AI 上线新工具：分析、可视化 CSV 文件中的数据

沃顿商学院教授发文解析o1：能力仍有短板，「人机协同智能」或成AGI最重要难题

Level up your problem-solving and strategic thinking skills with Amazon Bedrock

How OpenAI’s o3, Grok 3, DeepSeek R1, Gemini 2.0, and Claude 3.7 Differ in Their Reasoning Approaches

ChatGPT支援這麼多種模型怎麼用？OpenA揭露各自強項和擅長任務

Model Alignment Search

Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)

Comparative Analysis of CNN and Transformer Architectures with Heart Cycle Normalization for Automated Phonocardiogram Classification

[OpenAI] 大家觉得哪家的大模型同时兼顾生产力和生活的能力比较好？