热点
"评估平台" 相关文章
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
cs.AI updates on arXiv.org 2025-10-13T04:13:09.000000Z
BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
cs.AI updates on arXiv.org 2025-10-06T04:18:56.000000Z
全球首个科研LLM竞技场上线,23款顶尖模型火拼:o3夺冠,DeepSeek第四
36kr 2025-07-11T08:29:13.000000Z
Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents
cs.AI updates on arXiv.org 2025-07-09T04:01:25.000000Z
Coval evaluates AI voice and chat agents like self-driving cars
TechCrunch News 2025-01-23T15:05:35.000000Z
GenAI-Arena: An Open Platform for Community-Based Evaluation of Generative AI Models
MarkTechPost@AI 2024-06-13T05:01:50.000000Z
Patronus AI Created a Groundbreaking Automated Evaluation Platform
AiThority 2024-05-30T06:32:17.000000Z