评估方法_Fishai

热点

"评估方法" 相关文章

SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

cs.AI updates on arXiv.org 2025-11-03T05:18:46.000000Z

WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios

cs.AI updates on arXiv.org 2025-10-31T04:05:58.000000Z

Reliable Evaluation and Benchmarks for Statement Autoformalization

cs.AI updates on arXiv.org 2025-10-30T04:23:15.000000Z

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

cs.AI updates on arXiv.org 2025-10-29T04:27:54.000000Z

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

cs.AI updates on arXiv.org 2025-10-29T04:17:20.000000Z

AgentArcEval: An Architecture Evaluation Method for Foundation Model based Agents

cs.AI updates on arXiv.org 2025-10-27T06:22:56.000000Z

Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

cs.AI updates on arXiv.org 2025-10-24T04:20:50.000000Z

Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities

cs.AI updates on arXiv.org 2025-10-24T04:20:50.000000Z

A Coherence-Based Measure of AGI

cs.AI updates on arXiv.org 2025-10-24T04:19:18.000000Z

Ling-1T Leads Non-Reasoning Performance, MCP Poses Security Risks, California Regulates AI, Auto-Tune for Agentic Prompts

Deeplearning 2025-10-23T17:31:06.000000Z

That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

cs.AI updates on arXiv.org 2025-10-23T04:15:42.000000Z

That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

cs.AI updates on arXiv.org 2025-10-23T04:15:42.000000Z

That's Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

cs.AI updates on arXiv.org 2025-10-23T04:15:42.000000Z

Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

cs.AI updates on arXiv.org 2025-10-23T04:11:32.000000Z

PICABench: How Far Are We from Physically Realistic Image Editing?

cs.AI updates on arXiv.org 2025-10-21T04:28:44.000000Z

让 AI 学会“灵魂拷问”：我们如何教机器评判生成视频 | ICCV 2025

AI科技评论 2025-10-17T11:58:31.000000Z

Time Series Foundation Models: Benchmarking Challenges and Requirements

cs.AI updates on arXiv.org 2025-10-16T04:28:55.000000Z

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

cs.AI updates on arXiv.org 2025-10-16T04:27:37.000000Z

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

cs.AI updates on arXiv.org 2025-10-16T04:27:37.000000Z

Hey, wait a minute: on at-issue sensitivity in Language Models

cs.AI updates on arXiv.org 2025-10-15T05:10:44.000000Z

Copyright © 2019 FISHAI.All Rights Reserved