AI能力评估_Fishai

热点

"AI能力评估" 相关文章

K2-Thinking 开源，支持 300 步工具调用（附：绝世好 prompt）

赛博禅心 2025-11-07T12:03:45.000000Z

The "Length" of "Horizons"

少点错误 2025-10-14T16:36:04.000000Z

[GDPval] Models Could Automate the U.S. Economy by 2027

少点错误 2025-09-30T11:57:51.000000Z

What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

cs.AI updates on arXiv.org 2025-09-25T05:02:20.000000Z

GPT-5冷酷操盘，狼人杀一战封神！七大LLM狂飙演技，人类玩家看完沉默

智源社区 2025-09-01T11:28:03.000000Z

Evaluating Prediction in Acausal Mixed-Motive Settings

少点错误 2025-09-01T01:18:11.000000Z

Do model evaluations fall prey to the Good(er) Regulator Theorem?

少点错误 2025-08-19T16:19:32.000000Z

Anthropic Is Going All In On Ability Without Intelligence?

少点错误 2025-08-07T06:02:37.000000Z

The Mirror Test: How We've Overcomplicated AI Self-Recognition

少点错误 2025-07-24T09:18:02.000000Z

The Mirror Test: How We've Overcomplicated AI Self-Recognition

少点错误 2025-07-23T00:47:04.000000Z

The Elicitation Game: Evaluating Capability Elicitation Techniques

cs.AI updates on arXiv.org 2025-07-22T04:44:37.000000Z

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

cs.AI updates on arXiv.org 2025-07-16T04:28:40.000000Z

Interpreting the METR Time Horizons Post

少点错误 2025-04-30T03:12:28.000000Z

Recent AI model progress feels mostly like bullshit

少点错误 2025-03-24T19:32:10.000000Z

The Elicitation Game: Evaluating capability elicitation techniques

少点错误 2025-02-27T20:36:59.000000Z

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

TechCrunch News 2025-02-06T06:12:36.000000Z

Understanding Benchmarks and motivating Evaluations

少点错误 2025-02-06T01:51:47.000000Z

“人类终极考试”基准测试发布：顶级 AI 系统表现惨淡，回答准确率均未超 10%

IT之家 2025-01-24T08:37:28.000000Z

Copyright © 2019 FISHAI.All Rights Reserved