热点
"基准测试" 相关文章
OpenAI, Amazon, and $38B
The Rundown AI -每日精选 2025-11-05T06:51:27.000000Z
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models
cs.AI updates on arXiv.org 2025-11-05T05:31:41.000000Z
Towards Robust Mathematical Reasoning
cs.AI updates on arXiv.org 2025-11-05T05:31:07.000000Z
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
cs.AI updates on arXiv.org 2025-11-05T05:30:01.000000Z
FedOnco-Bench: A Reproducible Benchmark for Privacy-Aware Federated Tumor Segmentation with Synthetic CT Data
cs.AI updates on arXiv.org 2025-11-05T05:27:02.000000Z
Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
cs.AI updates on arXiv.org 2025-11-05T05:13:50.000000Z
QuantumBench: A Benchmark for Quantum Problem Solving
cs.AI updates on arXiv.org 2025-11-05T05:13:06.000000Z
OSWorld-MCP 正式发布:面向 Computer-Use Agents 产品的评测基准
oschina.net 2025-11-04T09:45:13.000000Z
360 开源视觉语言对齐模型 FG-CLIP2
oschina.net 2025-11-03T10:19:37.000000Z
RepoMasterEval: Evaluating Code Completion via Real-World Repositories
cs.AI updates on arXiv.org 2025-11-03T05:20:09.000000Z
CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments
cs.AI updates on arXiv.org 2025-11-03T05:19:51.000000Z
LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval
cs.AI updates on arXiv.org 2025-11-03T05:18:59.000000Z
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
cs.AI updates on arXiv.org 2025-11-03T05:18:49.000000Z
阿里通义千问能力最强语言模型,Qwen3-Max 已在官网上线深度思考功能
IT之家 2025-11-02T14:14:10.000000Z
AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
cs.AI updates on arXiv.org 2025-10-31T04:09:32.000000Z
Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses
cs.AI updates on arXiv.org 2025-10-31T04:02:21.000000Z
Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
cs.AI updates on arXiv.org 2025-10-31T04:01:55.000000Z
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
cs.AI updates on arXiv.org 2025-10-30T04:21:16.000000Z
Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents
cs.AI updates on arXiv.org 2025-10-30T04:21:10.000000Z
Beyond Function-Level Search: Repository-Aware Dual-Encoder Code Retrieval with Adversarial Verification
cs.AI updates on arXiv.org 2025-10-30T04:14:15.000000Z