热点
"Model Evaluation" 相关文章
A/B testing could lead LLMs to retain users instead of helping them
少点错误 2025-11-04T19:47:28.000000Z
Flawed AI benchmarks put enterprise budgets at risk
AI News 2025-11-04T15:25:46.000000Z
Measuring what matters: How offline evaluation of GitHub MCP Server works
The GitHub Blog 2025-10-30T22:00:37.000000Z
How Do We Evaluate the Quality of LLMs' Mathematical Responses?
少点错误 2025-10-29T09:14:28.000000Z
破解AI对不同上下⽂位置的敏感度不⼀致,新框架使出“解铃还须系铃人”
智源社区 2025-10-27T17:39:43.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T10:43:24.000000Z
R-HORIZON:长程推理时代来临,复旦NLP&美团LongCat重磅发布LRMs能力边界探测新范式
机器之心 2025-10-23T06:16:01.000000Z
简评DeepSeek-OCR
孔某人的低维认知 2025-10-22T15:00:19.000000Z
2025.10.20 | RPC剪枝提速保准;OmniVinci小数据跨模态称王
HuggingFace 每日AI论文速递 2025-10-21T08:18:56.000000Z
从会画画到会思考:快手可灵提出T2I-CoReBench,最强模型也难逃推理瓶颈
PaperWeekly 2025-10-19T08:34:29.000000Z
喝点VC|YC对谈Anthropic预训练负责人:预训练团队也要考虑推理问题,如何平衡预训练和后训练仍在早期探索阶段
Z Potentials 2025-10-16T09:58:40.000000Z
华中科大等发布OCRBench v2,Gemini获中文榜冠军但分数仅及格
36kr-科技 2025-10-14T07:29:44.000000Z
Assuring Agent Safety Evaluations By Analysing Transcripts
少点错误 2025-10-10T10:02:54.000000Z
Anthropic's latest AI model can tell when it's being evaluated: 'I think you're testing me'
All Content from Business Insider 2025-10-07T08:38:21.000000Z
No Title
OpenAI Cookbook 2025-10-06T17:55:27.000000Z
Claude Sonnet 4.5 模型被曝要求测试人员坦白:“我觉得你在测试我”
IT之家 2025-10-03T01:16:29.000000Z
Claude Sonnet 4.5 模型被曝要求测试人员坦白:“我觉得你在测试我”
IT之家 2025-10-03T01:16:29.000000Z
How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?
MarkTechPost@AI 2025-09-25T07:11:35.000000Z
速递|Claude与OpenAI都在用:红杉领投AI代码审查,Irregula获8000万美元融资估值达4.5亿
Z Potentials 2025-09-18T05:50:34.000000Z
反转!LeCun刚转发「全球最快开源推理模型」,ETH苏黎世就直接打假
智源社区 2025-09-17T04:13:58.000000Z