热点
关于我们
xx
xx
"
评估套件
" 相关文章
POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
cs.AI updates on arXiv.org
2025-11-10T05:10:42.000000Z
DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay
cs.AI updates on arXiv.org
2025-10-14T04:08:33.000000Z
Evaluating Embedding Frameworks for Scientific Domain
cs.AI updates on arXiv.org
2025-10-09T04:05:38.000000Z
CharXiv: A Comprehensive Evaluation Suite Advancing Multimodal Large Language Models Through Realistic Chart Understanding Benchmarks
MarkTechPost@AI
2024-06-29T04:01:35.000000Z