基准_Fishai

热点

"基准" 相关文章

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

cs.AI updates on arXiv.org 2025-11-03T05:20:14.000000Z

高维时序预测的ImageNet时刻！首个高维时序预测基准发布，模型领跑多数据集SOTA

智源社区 2025-10-29T16:23:53.000000Z

ChessQA: Evaluating Large Language Models for Chess Understanding

cs.AI updates on arXiv.org 2025-10-29T04:24:43.000000Z

DAG-Math: Graph-Guided Mathematical Reasoning in LLMs

cs.AI updates on arXiv.org 2025-10-24T04:16:01.000000Z

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

cs.AI updates on arXiv.org 2025-10-21T04:24:01.000000Z

OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models

cs.AI updates on arXiv.org 2025-10-21T04:22:57.000000Z

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

cs.AI updates on arXiv.org 2025-10-20T04:14:11.000000Z

CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models

cs.AI updates on arXiv.org 2025-10-16T04:24:52.000000Z

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

cs.AI updates on arXiv.org 2025-10-15T04:38:56.000000Z

TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

cs.AI updates on arXiv.org 2025-10-13T04:09:36.000000Z

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

cs.AI updates on arXiv.org 2025-10-10T04:16:35.000000Z

Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

cs.AI updates on arXiv.org 2025-10-09T04:13:30.000000Z

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

cs.AI updates on arXiv.org 2025-10-09T04:06:31.000000Z

BanglaLlama: LLaMA for Bangla Language

cs.AI updates on arXiv.org 2025-10-08T04:15:31.000000Z

Kaputt: A Large-Scale Dataset for Visual Defect Detection

cs.AI updates on arXiv.org 2025-10-08T04:14:41.000000Z

FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning

cs.AI updates on arXiv.org 2025-10-07T04:07:24.000000Z

Emergent evaluation hubs in a decentralizing large language model ecosystem

cs.AI updates on arXiv.org 2025-10-03T04:13:17.000000Z

DexBench: Benchmarking LLMs for Personalized Decision Making in Diabetes Management

cs.AI updates on arXiv.org 2025-10-02T04:16:50.000000Z

Deconstructing Self-Bias in LLM-generated Translation Benchmarks

cs.AI updates on arXiv.org 2025-10-01T06:02:04.000000Z

A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability

cs.AI updates on arXiv.org 2025-09-30T04:08:37.000000Z

Copyright © 2019 FISHAI.All Rights Reserved