热点
关于我们
xx
xx
"
智能体评估
" 相关文章
Task Completion Agents are Not Ideal Collaborators
cs.AI updates on arXiv.org
2025-10-30T04:21:34.000000Z
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
cs.AI updates on arXiv.org
2025-10-13T04:09:00.000000Z
What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment
cs.AI updates on arXiv.org
2025-10-13T04:09:00.000000Z
AgentBench: Evaluating LLMs as Agents
cs.AI updates on arXiv.org
2025-10-07T04:18:17.000000Z
Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
cs.AI updates on arXiv.org
2025-10-02T04:18:01.000000Z
Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
cs.AI updates on arXiv.org
2025-10-02T04:12:55.000000Z
2025.09.29 | 实时长视频边聊边播;分位数基线稳控推理熵
HuggingFace 每日AI论文速递
2025-09-30T01:15:14.000000Z
Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation
cs.AI updates on arXiv.org
2025-08-08T04:17:29.000000Z
2025智能体元年!斯坦福科学家8000字讲清所有要点
新智元
2025-01-25T17:07:25.000000Z
Four Cutting-Edge Methods for Evaluating AI Agents and Enhancing LLM Performance
MarkTechPost@AI
2024-11-28T10:34:49.000000Z
田渊栋团队新作祭出Agent-as-a-Judge!AI智能体自我审判,成本暴跌97%
智源社区
2024-10-29T00:37:52.000000Z
田渊栋团队新作祭出Agent-as-a-Judge!AI智能体自我审判,成本暴跌97%
新智元
2024-10-28T14:26:08.000000Z
卷起来!让智能体评估智能体,Meta发布Agent-as-a-Judge
机器之心
2024-10-18T10:11:37.000000Z
AI科学家太多,谁靠谱一试便知!普林斯顿新基准CORE-Bench:最强模型仅有21%准确率
智源社区
2024-09-26T09:53:20.000000Z