热点
"评估套件" 相关文章
POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
cs.AI updates on arXiv.org 2025-11-10T05:10:42.000000Z
DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay
cs.AI updates on arXiv.org 2025-10-14T04:08:33.000000Z
Evaluating Embedding Frameworks for Scientific Domain
cs.AI updates on arXiv.org 2025-10-09T04:05:38.000000Z
CharXiv: A Comprehensive Evaluation Suite Advancing Multimodal Large Language Models Through Realistic Chart Understanding Benchmarks
MarkTechPost@AI 2024-06-29T04:01:35.000000Z