热点
"AI能力评估" 相关文章
K2-Thinking 开源,支持 300 步工具调用(附:绝世好 prompt)
赛博禅心 2025-11-07T12:03:45.000000Z
The "Length" of "Horizons"
少点错误 2025-10-14T16:36:04.000000Z
[GDPval] Models Could Automate the U.S. Economy by 2027
少点错误 2025-09-30T11:57:51.000000Z
What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities
cs.AI updates on arXiv.org 2025-09-25T05:02:20.000000Z
GPT-5冷酷操盘,狼人杀一战封神!七大LLM狂飙演技,人类玩家看完沉默
智源社区 2025-09-01T11:28:03.000000Z
Evaluating Prediction in Acausal Mixed-Motive Settings
少点错误 2025-09-01T01:18:11.000000Z
Do model evaluations fall prey to the Good(er) Regulator Theorem?
少点错误 2025-08-19T16:19:32.000000Z
Anthropic Is Going All In On Ability Without Intelligence?
少点错误 2025-08-07T06:02:37.000000Z
The Mirror Test: How We've Overcomplicated AI Self-Recognition
少点错误 2025-07-24T09:18:02.000000Z
The Mirror Test: How We've Overcomplicated AI Self-Recognition
少点错误 2025-07-23T00:47:04.000000Z
The Elicitation Game: Evaluating Capability Elicitation Techniques
cs.AI updates on arXiv.org 2025-07-22T04:44:37.000000Z
DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering
cs.AI updates on arXiv.org 2025-07-16T04:28:40.000000Z
Interpreting the METR Time Horizons Post
少点错误 2025-04-30T03:12:28.000000Z
Recent AI model progress feels mostly like bullshit
少点错误 2025-03-24T19:32:10.000000Z
The Elicitation Game: Evaluating capability elicitation techniques
少点错误 2025-02-27T20:36:59.000000Z
These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models
TechCrunch News 2025-02-06T06:12:36.000000Z
Understanding Benchmarks and motivating Evaluations
少点错误 2025-02-06T01:51:47.000000Z
“人类终极考试”基准测试发布:顶级 AI 系统表现惨淡,回答准确率均未超 10%
IT之家 2025-01-24T08:37:28.000000Z