智能体评估_Fishai

热点

"智能体评估" 相关文章

Task Completion Agents are Not Ideal Collaborators

cs.AI updates on arXiv.org 2025-10-30T04:21:34.000000Z

What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

cs.AI updates on arXiv.org 2025-10-13T04:09:00.000000Z

What Is Your Agent's GPA? A Framework for Evaluating Agent Goal-Plan-Action Alignment

cs.AI updates on arXiv.org 2025-10-13T04:09:00.000000Z

AgentBench: Evaluating LLMs as Agents

cs.AI updates on arXiv.org 2025-10-07T04:18:17.000000Z

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

cs.AI updates on arXiv.org 2025-10-02T04:18:01.000000Z

Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm

cs.AI updates on arXiv.org 2025-10-02T04:12:55.000000Z

2025.09.29 | 实时长视频边聊边播；分位数基线稳控推理熵

HuggingFace 每日AI论文速递 2025-09-30T01:15:14.000000Z

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

cs.AI updates on arXiv.org 2025-08-08T04:17:29.000000Z

2025智能体元年！斯坦福科学家8000字讲清所有要点

新智元 2025-01-25T17:07:25.000000Z

Four Cutting-Edge Methods for Evaluating AI Agents and Enhancing LLM Performance

MarkTechPost@AI 2024-11-28T10:34:49.000000Z

田渊栋团队新作祭出Agent-as-a-Judge！AI智能体自我审判，成本暴跌97%

智源社区 2024-10-29T00:37:52.000000Z

田渊栋团队新作祭出Agent-as-a-Judge！AI智能体自我审判，成本暴跌97%

新智元 2024-10-28T14:26:08.000000Z

卷起来！让智能体评估智能体，Meta发布Agent-as-a-Judge

机器之心 2024-10-18T10:11:37.000000Z

AI科学家太多，谁靠谱一试便知！普林斯顿新基准CORE-Bench：最强模型仅有21%准确率

智源社区 2024-09-26T09:53:20.000000Z

Copyright © 2019 FISHAI.All Rights Reserved