基准测试_Fishai

热点

"基准测试" 相关文章

OpenAI, Amazon, and $38B

The Rundown AI -每日精选 2025-11-05T06:51:27.000000Z

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

cs.AI updates on arXiv.org 2025-11-05T05:31:41.000000Z

Towards Robust Mathematical Reasoning

cs.AI updates on arXiv.org 2025-11-05T05:31:07.000000Z

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

cs.AI updates on arXiv.org 2025-11-05T05:30:01.000000Z

FedOnco-Bench: A Reproducible Benchmark for Privacy-Aware Federated Tumor Segmentation with Synthetic CT Data

cs.AI updates on arXiv.org 2025-11-05T05:27:02.000000Z

Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

cs.AI updates on arXiv.org 2025-11-05T05:13:50.000000Z

QuantumBench: A Benchmark for Quantum Problem Solving

cs.AI updates on arXiv.org 2025-11-05T05:13:06.000000Z

OSWorld-MCP 正式发布：面向 Computer-Use Agents 产品的评测基准

oschina.net 2025-11-04T09:45:13.000000Z

360 开源视觉语言对齐模型 FG-CLIP2

oschina.net 2025-11-03T10:19:37.000000Z

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

cs.AI updates on arXiv.org 2025-11-03T05:20:09.000000Z

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

cs.AI updates on arXiv.org 2025-11-03T05:19:51.000000Z

LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

cs.AI updates on arXiv.org 2025-11-03T05:18:59.000000Z

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

cs.AI updates on arXiv.org 2025-11-03T05:18:49.000000Z

阿里通义千问能力最强语言模型，Qwen3-Max 已在官网上线深度思考功能

IT之家 2025-11-02T14:14:10.000000Z

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

cs.AI updates on arXiv.org 2025-10-31T04:09:32.000000Z

Questionnaire meets LLM: A Benchmark and Empirical Study of Structural Skills for Understanding Questions and Responses

cs.AI updates on arXiv.org 2025-10-31T04:02:21.000000Z

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

cs.AI updates on arXiv.org 2025-10-31T04:01:55.000000Z

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

cs.AI updates on arXiv.org 2025-10-30T04:21:16.000000Z

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

cs.AI updates on arXiv.org 2025-10-30T04:21:10.000000Z

Beyond Function-Level Search: Repository-Aware Dual-Encoder Code Retrieval with Adversarial Verification

cs.AI updates on arXiv.org 2025-10-30T04:14:15.000000Z

Copyright © 2019 FISHAI.All Rights Reserved