热点
关于我们
xx
xx
"
基准
" 相关文章
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
cs.AI updates on arXiv.org
2025-11-03T05:20:14.000000Z
高维时序预测的ImageNet时刻!首个高维时序预测基准发布,模型领跑多数据集SOTA
智源社区
2025-10-29T16:23:53.000000Z
ChessQA: Evaluating Large Language Models for Chess Understanding
cs.AI updates on arXiv.org
2025-10-29T04:24:43.000000Z
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
cs.AI updates on arXiv.org
2025-10-24T04:16:01.000000Z
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
cs.AI updates on arXiv.org
2025-10-21T04:24:01.000000Z
OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
cs.AI updates on arXiv.org
2025-10-21T04:22:57.000000Z
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
cs.AI updates on arXiv.org
2025-10-20T04:14:11.000000Z
CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
cs.AI updates on arXiv.org
2025-10-16T04:24:52.000000Z
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
cs.AI updates on arXiv.org
2025-10-15T04:38:56.000000Z
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
cs.AI updates on arXiv.org
2025-10-13T04:09:36.000000Z
Leveraging Whisper Embeddings for Audio-based Lyrics Matching
cs.AI updates on arXiv.org
2025-10-10T04:16:35.000000Z
Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships
cs.AI updates on arXiv.org
2025-10-09T04:13:30.000000Z
VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code
cs.AI updates on arXiv.org
2025-10-09T04:06:31.000000Z
BanglaLlama: LLaMA for Bangla Language
cs.AI updates on arXiv.org
2025-10-08T04:15:31.000000Z
Kaputt: A Large-Scale Dataset for Visual Defect Detection
cs.AI updates on arXiv.org
2025-10-08T04:14:41.000000Z
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning
cs.AI updates on arXiv.org
2025-10-07T04:07:24.000000Z
Emergent evaluation hubs in a decentralizing large language model ecosystem
cs.AI updates on arXiv.org
2025-10-03T04:13:17.000000Z
DexBench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management
cs.AI updates on arXiv.org
2025-10-02T04:16:50.000000Z
Deconstructing Self-Bias in LLM-generated Translation Benchmarks
cs.AI updates on arXiv.org
2025-10-01T06:02:04.000000Z
A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability
cs.AI updates on arXiv.org
2025-09-30T04:08:37.000000Z