热点
"AI evaluation" 相关文章
Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
VentureBeat 2025-11-04T20:12:23.000000Z
AGI有了「权威」新定义!图灵奖得主Yoshua Bengio等提出,GPT-5仅达57%
智源社区 2025-10-30T11:59:09.000000Z
AGI有了「权威」新定义,图灵奖得主Yoshua Bengio等提出,GPT-5仅达57%
36kr-科技 2025-10-29T10:18:40.000000Z
从「会画画」到「会思考」:快手可灵团队提出 T2I-CoReBench,最强模型也难逃推理瓶颈
我爱计算机视觉 2025-10-25T08:56:32.000000Z
LLM模型指令遵循偏差
掘金 人工智能 2025-10-24T19:02:12.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T09:00:14.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T09:00:14.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T06:48:09.000000Z
Braintrust Java SDK: AI observability and evals for the JVM
Braintrust Blog 2025-10-24T05:16:48.000000Z
Braintrust Java SDK: AI observability and evals for the JVM
Braintrust Blog 2025-10-24T05:16:48.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T15:13:53.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T14:32:56.000000Z
ICCV 2025 | AI能看懂电影剧情吗?VRBench开启首场“长视频推理大考”
PaperWeekly 2025-10-22T14:32:56.000000Z
Instagram cofounder rips ‘AI FOMO’ that caused a rush to adopt and no metrics: ‘When it gets fuzzy, it’s very hard to then evaluate’
Fortune | FORTUNE 2025-10-21T17:20:48.000000Z
让模型“看视频写网页”,GPT-5仅得36.35分!上海AI Lab联合发布首个video2code基准
量子位 2025-10-20T12:34:13.000000Z
Bengio推AGI「高考」,GPT-5单项0分
新智元 2025-10-17T16:17:19.000000Z
Stop Measuring AI Like Software
Communications of the ACM - Artificial Intelligence 2025-10-17T14:49:19.000000Z
Stop Measuring AI Like Software
Communications of the ACM - Artificial Intelligence 2025-10-17T14:49:19.000000Z
按照Bengio等大佬的AGI新定义,GPT-5才实现了不到10%
机器之心 2025-10-17T13:34:40.000000Z
让 AI 学会“灵魂拷问”:我们如何教机器评判生成视频 | ICCV 2025
AI科技评论 2025-10-17T11:58:31.000000Z