前沿模型在医学基准测试中的脆弱性

cs.AI updates on arXiv.org 10月02日

前沿模型在医学基准测试中的脆弱性

文章指出大型前沿模型如GPT-5在医学基准测试中存在缺陷，测试分数未能反映实际应用能力，呼吁对AI系统进行更严格的评估。

arXiv:2509.18234v2 Announce Type: replace Abstract: Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

前沿模型医学基准测试 AI应用

相关文章

国金证券：AI工具或模型迭代有望带来投资机会

东方证券：微软正式发力AIPC，重视端侧AI进展

华金证券：存储产业链相关设备、材料等迎来黄金发展机遇

Getac Announces World’s First AI-Ready Rugged Laptop

和朋友聊天, 他付费了 AI PPT 的会员, 原因是孩子老师经常要求家长和孩子一起做 PPT 呈现一些主题。烦不胜烦，又无法拒绝。有了 AI 工具，可以轻松拿出一个可...

「只要我不能创造的，就代表我还不理解。」—— 理查德·费曼 ? 人人都能用好AI：这份原创心法和案例集锦，献给新上手GPT-4o的即友们为纪念 ChatGPT 全面免...

联发科宣布加入Arm全面设计

中信证券：AI大模型加速迭代，算力高景气

长城证券：持续看好算力基础设施投资机会

威尔高(301251.SZ)：公司供应的PCB产品有应用到AI领域