热点
"AI Evaluation" 相关文章
AI“世界模型”离真实手术还有多远?首个外科视频生成基准SurgVeo揭示“合理性差距”
我爱计算机视觉 2025-11-10T10:47:42.000000Z
全球第二、国内第一!最强文本的文心5.0 Preview一手实测来了
机器之心 2025-11-09T18:10:13.000000Z
研究指出AI能力可能因存在缺陷的测试而被夸大
cnBeta全文版 2025-11-07T07:45:00.000000Z
评估工程正成为下一轮 Agent 演进的重点
oschina.net 2025-11-07T03:08:18.000000Z
牛津大学研究:当前基准测试普遍夸大了 AI 模型的性能
IT之家 2025-11-06T09:57:46.000000Z
Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
VentureBeat 2025-11-04T20:12:23.000000Z
AGI有了「权威」新定义!图灵奖得主Yoshua Bengio等提出,GPT-5仅达57%
智源社区 2025-10-30T11:59:09.000000Z
AGI有了「权威」新定义,图灵奖得主Yoshua Bengio等提出,GPT-5仅达57%
36kr-科技 2025-10-29T10:18:40.000000Z
从「会画画」到「会思考」:快手可灵团队提出 T2I-CoReBench,最强模型也难逃推理瓶颈
我爱计算机视觉 2025-10-25T08:56:32.000000Z
LLM模型指令遵循偏差
掘金 人工智能 2025-10-24T19:02:12.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T09:00:14.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T09:00:14.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T06:48:09.000000Z
Braintrust Java SDK: AI observability and evals for the JVM
Braintrust Blog 2025-10-24T05:16:48.000000Z
Braintrust Java SDK: AI observability and evals for the JVM
Braintrust Blog 2025-10-24T05:16:48.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T15:13:53.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T14:32:56.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T14:32:56.000000Z
Instagram cofounder rips ‘AI FOMO’ that caused a rush to adopt and no metrics: ‘When it gets fuzzy, it’s very hard to then evaluate’
Fortune | FORTUNE 2025-10-21T17:20:48.000000Z
让模型“看视频写网页”,GPT-5仅得36.35分!上海AI Lab联合发布首个video2code基准
量子位 2025-10-20T12:34:13.000000Z