热点
"基准平台" 相关文章
InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research
cs.AI updates on arXiv.org 2025-11-03T05:18:27.000000Z
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature
cs.AI updates on arXiv.org 2025-10-14T04:09:48.000000Z
SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities
cs.AI updates on arXiv.org 2025-10-08T04:15:26.000000Z
TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods
cs.AI updates on arXiv.org 2025-08-19T04:21:11.000000Z
BALSAM: A Platform for Benchmarking Arabic Large Language Models
cs.AI updates on arXiv.org 2025-07-31T04:48:13.000000Z
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors
cs.AI updates on arXiv.org 2025-07-22T04:34:13.000000Z
GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
cs.AI updates on arXiv.org 2025-07-08T04:33:41.000000Z