热点
"评估工具" 相关文章
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
cs.AI updates on arXiv.org 2025-10-20T04:14:06.000000Z
Braintrust is not an eval framework
Braintrust Blog 2025-10-02T12:52:11.000000Z
Five hard-learned lessons about AI evals
Braintrust Blog 2025-10-02T12:52:09.000000Z
Generative Value Conflicts Reveal LLM Priorities
cs.AI updates on arXiv.org 2025-10-01T06:00:13.000000Z
Kling 2.0: uncanny valley crossed — video creation will never be the same
Coding with Intelligence 2025-09-25T10:01:24.000000Z
HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
cs.AI updates on arXiv.org 2025-09-16T05:39:20.000000Z
A new initiative for developing third-party model evaluations
Newsroom Anthropic 2025-09-13T01:26:42.000000Z
Google AI Introduces Stax: A Practical AI Tool for Evaluating Large Language Models LLMs
MarkTechPost@AI 2025-09-04T05:41:04.000000Z
RecUserSim: A Realistic and Diverse User Simulator for Evaluating Conversational Recommender Systems
cs.AI updates on arXiv.org 2025-08-01T04:08:18.000000Z
REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services
cs.AI updates on arXiv.org 2025-07-08T04:33:43.000000Z
How Long Can Open-Source LLMs Truly Promise on Context Length?
2024-10-02T06:00:21.000000Z
Model evals for dangerous capabilities
少点错误 2024-09-23T11:07:45.000000Z
Show HN: AIQuizgen - 使用我们的人工智能测验生成器改变您的评估方式
buzz 2024-06-04T15:33:13.000000Z