MarkTechPost@AI 09月16日
斯坦福发布医疗AI智能体基准测试MedAgentBench
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

斯坦福大学研究人员推出了MedAgentBench,这是一个用于评估医疗领域大型语言模型(LLM)智能体的新基准测试套件。不同于以往的问答数据集,MedAgentBench提供了一个虚拟的电子健康记录(EHR)环境,AI系统需在此环境中进行交互、规划并执行多步骤的临床任务。这标志着评估重点从静态推理转向了在真实、基于工具的医疗工作流程中评估智能体能力。该基准包含300个由执业医师设计的任务,涵盖了患者信息检索、结果追踪、文档记录、医嘱开具等,并使用了100个真实的去标识化患者档案,为评估医疗AI智能体的真实世界性能提供了重要框架。

🩺 **MedAgentBench的创新之处**:该基准测试套件创建了一个模拟的电子健康记录(EHR)环境,让AI智能体能够进行多步骤的临床任务交互、规划和执行,这是对以往静态问答式评估模式的重大革新,更侧重于评估AI在真实医疗工作流程中的实际应用能力。

📝 **丰富的任务和数据支持**:MedAgentBench包含由执业医师设计的300个任务,涵盖了患者信息检索、实验室结果追踪、文档记录、医嘱开具、转诊和药物管理等10个类别,并基于斯坦福STARR数据仓提取的100个逼真患者档案,这些档案包含了超过70万条去标识化但保持临床有效性的记录。

📊 **模型性能与挑战**:在对12个主流LLM的测试中,Claude 3.5 Sonnet v2表现最佳,成功率为69.67%,GPT-4o紧随其后。然而,大多数模型在查询任务上表现较好,但在需要安全多步执行的操作类任务上仍面临挑战,主要错误模式包括指令遵循失败和输出不匹配,这凸显了在临床应用中对精确性和可靠性的要求。

🛠️ **FHIR合规环境与评估标准**:MedAgentBench构建了一个符合FHIR标准的模拟环境,支持EHR数据的检索和修改,能够模拟真实的临床交互。评估指标采用严格的“成功率(SR)”,以pass@1进行衡量,这反映了真实世界对安全性的高要求。

A team of Stanford University researchers have released MedAgentBench, a new benchmark suite designed to evaluate large language model (LLM) agents in healthcare contexts. Unlike prior question-answering datasets, MedAgentBench provides a virtual electronic health record (EHR) environment where AI systems must interact, plan, and execute multi-step clinical tasks. This marks a significant shift from testing static reasoning to assessing agentic capabilities in live, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

Why Do We Need Agentic Benchmarks in Healthcare?

Recent LLMs have moved beyond static chat-based interactions toward agentic behavior—interpreting high-level instructions, calling APIs, integrating patient data, and automating complex processes. In medicine, this evolution could help address staff shortages, documentation burden, and administrative inefficiencies.

While general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical data, FHIR interoperability, and longitudinal patient records. MedAgentBench fills this gap by offering a reproducible, clinically relevant evaluation framework.

What Does MedAgentBench Contain?

How Are the Tasks Structured?

MedAgentBench consists of 300 tasks across 10 categories, written by licensed physicians. These tasks include patient information retrieval, lab result tracking, documentation, test ordering, referrals, and medication management. Tasks average 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

What Patient Data Supports the Benchmark?

The benchmark leverages 100 realistic patient profiles extracted from Stanford’s STARR data repository, comprising over 700,000 records including labs, vitals, diagnoses, procedures, and medication orders. Data was de-identified and jittered for privacy while preserving clinical validity.

How Is the Environment Built?

The environment is FHIR-compliant, supporting both retrieval (GET) and modification (POST) of EHR data. AI systems can simulate realistic clinical interactions such as documenting vitals or placing medication orders. This design makes the benchmark directly translatable to live EHR systems.

How Are Models Evaluated?

Which Models Performed Best?

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

What Errors Did Models Make?

Two dominant failure patterns emerged:

    Instruction adherence failures — invalid API calls or incorrect JSON formatting.Output mismatch — providing full sentences when structured numerical values were required.

These errors highlight gaps in precision and reliability, both critical in clinical deployment.

Summary

MedAgentBench establishes the first large-scale benchmark for evaluating LLM agents in realistic EHR settings, pairing 300 clinician-authored tasks with a FHIR-compliant environment and 100 patient profiles. Results show strong potential but limited reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the gap between query success and safe action execution. While constrained by single-institution data and EHR-focused scope, MedAgentBench provides an open, reproducible framework to drive the next generation of dependable healthcare AI agents


Check out the PAPER and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MedAgentBench 医疗AI LLM智能体 电子健康记录 基准测试 Healthcare AI LLM Agents Electronic Health Records Benchmark
相关文章