热点
关于我们
xx
xx
"
评估
" 相关文章
FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video
cs.AI updates on arXiv.org
2025-11-05T05:18:58.000000Z
Webinar recap: Eval best practices
Braintrust Blog
2025-11-05T04:39:32.000000Z
Scalable Oversight via Partitioned Human Supervision
cs.AI updates on arXiv.org
2025-10-28T04:14:32.000000Z
PaperAsk: A Benchmark for Reliability Evaluation of LLMs in Paper Search and Reading
cs.AI updates on arXiv.org
2025-10-28T04:14:09.000000Z
Harnessing the Power of Large Language Models for Software Testing Education: A Focus on ISTQB Syllabus
cs.AI updates on arXiv.org
2025-10-28T04:14:09.000000Z
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
cs.AI updates on arXiv.org
2025-10-28T04:11:10.000000Z
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
cs.AI updates on arXiv.org
2025-10-28T04:02:55.000000Z
List of lists of project ideas in AI Safety
少点错误
2025-10-27T08:42:17.000000Z
如何写好AI提示词?
掘金 人工智能
2025-10-24T19:00:52.000000Z
Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
cs.AI updates on arXiv.org
2025-10-24T04:27:00.000000Z
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
cs.AI updates on arXiv.org
2025-10-22T04:23:52.000000Z
Are LLMs Court-Ready? Evaluating Frontier Models on Indian Legal Reasoning
cs.AI updates on arXiv.org
2025-10-22T04:17:01.000000Z
Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
cs.AI updates on arXiv.org
2025-10-21T04:18:41.000000Z
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
cs.AI updates on arXiv.org
2025-10-20T04:14:11.000000Z
StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation
cs.AI updates on arXiv.org
2025-10-16T04:26:14.000000Z
Scheming Ability in LLM-to-LLM Strategic Interactions
cs.AI updates on arXiv.org
2025-10-16T04:23:05.000000Z
Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
cs.AI updates on arXiv.org
2025-10-15T04:34:54.000000Z
吴恩达Agentic AI新课:手把手教你搭建Agent工作流,GPT-3.5反杀GPT-4就顺手的事
量子位
2025-10-14T09:14:35.000000Z
How to Evaluate Your RAG Pipeline with Synthetic Data?
MarkTechPost@AI
2025-10-13T21:33:59.000000Z
Objective Features Extracted from Motor Activity Time Series for Food Addiction Analysis Using Machine Learning - A Pilot Study
cs.AI updates on arXiv.org
2025-10-10T04:20:58.000000Z