热点
关于我们
xx
xx
"
评估方法
" 相关文章
SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
cs.AI updates on arXiv.org
2025-11-03T05:18:46.000000Z
WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
cs.AI updates on arXiv.org
2025-10-31T04:05:58.000000Z
Reliable Evaluation and Benchmarks for Statement Autoformalization
cs.AI updates on arXiv.org
2025-10-30T04:23:15.000000Z
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
cs.AI updates on arXiv.org
2025-10-29T04:27:54.000000Z
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
cs.AI updates on arXiv.org
2025-10-29T04:17:20.000000Z
AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents
cs.AI updates on arXiv.org
2025-10-27T06:22:56.000000Z
Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
cs.AI updates on arXiv.org
2025-10-24T04:20:50.000000Z
Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
cs.AI updates on arXiv.org
2025-10-24T04:20:50.000000Z
A Coherence-Based Measure of AGI
cs.AI updates on arXiv.org
2025-10-24T04:19:18.000000Z
Ling-1T Leads Non-Reasoning Performance, MCP Poses Security Risks, California Regulates AI, Auto-Tune for Agentic Prompts
Deeplearning
2025-10-23T17:31:06.000000Z
That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
cs.AI updates on arXiv.org
2025-10-23T04:15:42.000000Z
That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
cs.AI updates on arXiv.org
2025-10-23T04:15:42.000000Z
That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation
cs.AI updates on arXiv.org
2025-10-23T04:15:42.000000Z
Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents
cs.AI updates on arXiv.org
2025-10-23T04:11:32.000000Z
PICABench: How Far Are We from Physically Realistic Image Editing?
cs.AI updates on arXiv.org
2025-10-21T04:28:44.000000Z
让 AI 学会“灵魂拷问”:我们如何教机器评判生成视频 | ICCV 2025
AI科技评论
2025-10-17T11:58:31.000000Z
Time Series Foundation Models: Benchmarking Challenges and Requirements
cs.AI updates on arXiv.org
2025-10-16T04:28:55.000000Z
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
cs.AI updates on arXiv.org
2025-10-16T04:27:37.000000Z
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
cs.AI updates on arXiv.org
2025-10-16T04:27:37.000000Z
Hey, wait a minute: on at-issue sensitivity in Language Models
cs.AI updates on arXiv.org
2025-10-15T05:10:44.000000Z