热点
"模型评估" 相关文章
A/B testing could lead LLMs to retain users instead of helping them
少点错误 2025-11-04T19:47:28.000000Z
Flawed AI benchmarks put enterprise budgets at risk
AI News 2025-11-04T15:25:46.000000Z
你的 Agent 还在烧钱?学会用 LangSmith 监控,每个 Token 都能精打细算
掘金 人工智能 2025-11-04T00:11:34.000000Z
RepoMasterEval: Evaluating Code Completion via Real-World Repositories
cs.AI updates on arXiv.org 2025-11-03T05:20:09.000000Z
Language Model Preference Evaluation with Multiple Weak Evaluators
cs.AI updates on arXiv.org 2025-10-31T04:11:57.000000Z
SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
cs.AI updates on arXiv.org 2025-10-31T04:05:51.000000Z
Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
cs.AI updates on arXiv.org 2025-10-31T04:02:45.000000Z
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets
cs.AI updates on arXiv.org 2025-10-30T04:23:13.000000Z
How Do We Evaluate the Quality of LLMs' Mathematical Responses?
少点错误 2025-10-29T09:14:28.000000Z
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content
cs.AI updates on arXiv.org 2025-10-29T04:28:19.000000Z
VisCoder2: Building Multi-Language Visualization Coding Agents
cs.AI updates on arXiv.org 2025-10-29T04:21:57.000000Z
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
cs.AI updates on arXiv.org 2025-10-28T04:14:33.000000Z
破解AI对不同上下⽂位置的敏感度不⼀致,新框架使出“解铃还须系铃人”
智源社区 2025-10-27T17:39:43.000000Z
Seedream 4.0大战Nano Banana、GPT-4o?EdiVal-Agent 终结图像编辑评测
机器之心 2025-10-24T09:00:14.000000Z
A computational model and tool for generating more novel opportunities in professional innovation processes
cs.AI updates on arXiv.org 2025-10-24T04:18:11.000000Z
FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance
cs.AI updates on arXiv.org 2025-10-23T04:45:45.000000Z
Probing Perceptual Constancy in Large Vision-Language Models
cs.AI updates on arXiv.org 2025-10-23T04:45:27.000000Z
Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series
cs.AI updates on arXiv.org 2025-10-23T04:22:43.000000Z
Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series
cs.AI updates on arXiv.org 2025-10-23T04:22:43.000000Z
The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
cs.AI updates on arXiv.org 2025-10-23T04:09:55.000000Z