Braintrust Blog 10月11日 07:05
AI系统评估:构建可信赖的迭代流程
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI系统,特别是大型语言模型(LLMs)在开发过程中面临的随机性挑战,即如何确信改进是真实有效的而非偶然。作者强调了系统性评估(Evals)的重要性,它能提供统计学上的信心,在用户发现问题前捕获回归,并促进持续改进。文章详细介绍了构建评估体系的三个核心要素:评估任务(Task)、数据集(Data)和评分机制(Scores),并提供了具体的代码示例。此外,文章还强调了明确成功指标、采用分层评估策略(端到端到单步)、制定有效的评分准则(如单一维度、二元制、严苛阈值、利用生产反馈)以及将生产数据反馈到开发流程的重要性,最终目标是实现AI产品质量的系统性提升,避免“凭感觉”开发。

🎯 **AI系统的随机性挑战与评估的必要性**:AI系统(尤其是LLMs)的非确定性输出使得验证改进效果变得困难。文章指出,缺乏系统性评估会导致团队“凭感觉”开发,难以追踪模型性能,也无法有效整合用户反馈。因此,建立一套行之有效的评估体系(Evals)至关重要,它能提供统计学上的信心,提前发现问题,并支持持续改进。

🛠️ **构建评估体系的核心要素**:一个完整的评估体系包含三个关键部分:1. **评估任务(Task)**:指代被测试和迭代的AI系统本身,可以是简单的系统提示,也可以是复杂的代理或多步工作流,其核心要求是接受输入并产生输出。2. **数据集(Data)**:包含真实用户交互的示例,应从小规模开始,快速暴露弱点,并随着生产反馈不断扩充,最终形成全面的测试套件。3. **评分机制(Scores)**:定义衡量标准,可以是代码评分(如长度、特定模式检测),也可以是LLM作为裁判来评估主观质量(如语气、帮助性),评分范围为0到1,便于分析。

📈 **端到端与分步评估及评分策略**:文章建议从端到端评估开始,验证整体目标是否达成,然后随着系统复杂度的增加,引入单步或中间步骤的评估,以便精确 pinpoint 问题根源。在评分方面,强调“一维一分”,即每个评分只衡量一个维度;初期采用二元制(0或1),设定严苛阈值以确保质量;并强烈建议利用生产反馈,将用户报告的问题转化为专门的评分维度,以应对实际应用中的失败模式。

🔄 **连接生产与开发:构建闭环改进**:最强大的评估策略是将生产数据反馈到开发流程中。通过自动日志记录、捕获失败、一键转换为测试用例,实现快速迭代和质量门禁(CI/CD集成)。这种机制形成了快速的内部反馈循环(Eval → Iterate → Eval)和更广泛的外部循环(Production → Datasets → Evals → Production),确保AI产品在真实世界使用中持续优化,从而实现更快速、更可靠的质量提升。

10 October 2025Carlos Esteban

This post is based on an online event on October 7, 2025. You can watch the full replay on YouTube.

AI systems are non-deterministic. Give an LLM the same question twice, and you'll get different answers. This randomness creates a fundamental problem: how do you know your changes improved quality instead of just shifting failure modes around?

Without a systematic evaluation process, teams build on vibes. You ship blind, fix one issue only to break another, and wonder which model performs best. You capture user feedback but can't integrate it into development. The questions pile up: Which LLM should I use? How do I catch hallucinations? When something breaks, where do I look?

Evals answer these questions. They give you statistical confidence that your changes work, catch regressions before users do, and create a process for continuous improvement. Here's how to build them.

Every eval in Braintrust follows this structure:

from autoevals import Levenshteinfrom braintrust import Eval Eval(    "My Project",    data=lambda: [        {"input": "Hello", "expected": "Hi"},        {"input": "Goodbye", "expected": "Bye"},    ],    task=lambda input: call_my_llm(input),    scores=[Levenshtein],)

Let's break down each component:

The task is your subject of evaluation -- the thing you're testing and iterating on. Start simple with a system prompt, but scale to whatever complexity you need: a chain of prompts, a full agent, or a multi-step workflow. The only requirement is that your task takes an input and produces an output.

For example, you might start with a basic prompt like "You are a helpful customer support agent" and test how it handles common questions. As you identify issues, you'll iterate: add retrieval for knowledge base articles, adjust the tone to be more empathetic, or incorporate tools for order lookups. The eval process shows you how each change affects your scores downstream.

Your dataset contains examples of how end users actually interact with your AI application. This isn't hypothetical -- it's real questions, real edge cases, real failure modes you didn't anticipate when you started.

Start small. Five to ten examples representing your main persona is enough to get started. The goal is reaching the eval phase fast, because evals expose weaknesses and drive actions that improve performance. As you gather production feedback and identify issues, feed those examples back into your dataset. Over time, this scales to hundreds of cases covering the full range of user interactions.

This becomes your golden dataset -- the comprehensive test suite you run before any production deployment. It catches regressions across all the ways users interact with your product.

Scores define what you care about and what you want to measure. Unlike unit tests with strict pass/fail criteria, eval scores return values between 0 and 1 -- a percentage that gives you flexibility in analysis and insight gathering.

Start with code scores. These are deterministic, cheap, and familiar to engineering teams. They're perfect for quantitative checks:

def conciseness_scorer(output: str, **kwargs) -> int:    # Penalize responses over 200 tokens    return 1 if len(output.split()) <= 200 else 0  def has_quotes_scorer(output: str, **kwargs) -> int:    # Check for quotation marks (not apostrophes)    import re     return 1 if re.search(r'[""]', output) else 0

But code scores can't capture everything. For subjective qualities like tone, helpfulness, or factual accuracy, you need LLM judges:

project.scorers.create(    name="Tone scorer",    messages=[        {            "role": "user",            "content": """Does the response maintain a professional yet empathetic tone? Response: {{output}} Choose:A) Professional and empatheticB) Professional but coldC) Unprofessional""",        }    ],    model="gpt-4o",    choice_scores={"A": 1, "B": 0.5, "C": 0},)

The key is aligning your LLM judge to human judgment through careful prompt design and validation.

Think of scores as the backbone of your eval. Each score should measure one dimension on a 0-1 scale, giving you clear signal on specific aspects of your AI's performance.

The biggest mistake teams make is kicking off an MVP without articulating requirements. This leads to going back to the whiteboard and restarting the process. Instead, define what success looks like upfront. Be specific about your success metrics and what you want to achieve.

This clarity creates an improvement target. When you know what you want, the path to get there becomes obvious. It also bridges the product-engineering gap. Product teams excel at defining success criteria (it's essentially a PRD) -- for example, "responses must include relevant citations," "tone should be empathetic but not apologetic," or "answers must address all parts of multi-part questions." Engineers then implement the evaluation infrastructure and ensure production data flows into the eval loop.

Begin with end-to-end evals where your agent or workflow runs completely, and you evaluate the final output. Ask: Did the agent accomplish its goal? This broad evaluation solves many problems without complexity.

As your system grows more complex with intermediate steps, you'll need single-turn or step-by-step evals. Was this the right tool call? Were the arguments correct? By evaluating each step along the way, you pinpoint what caused errors and fix the root cause instead of guessing.

For example, in a RAG use case, evaluate the retrieval step separately. Did it fetch the right documents? Were they relevant? This granular approach reveals exactly where your pipeline breaks down.

Follow these guidelines for scoring functions that actually work:

    One dimension per score: Don't try to measure factuality, conciseness, and tone in a single score. It confuses the LLM and makes human review harder. If you're checking five criteria, create five separate scores.

    Start binary: Use 0 or 1 initially. Does the response meet the threshold? Yes (1) or no (0). This makes review straightforward and alignment clear.

    Set harsh thresholds: If you're passing 100% of your evals, you're not setting the bar high enough. Low scores are okay -- they're a sign you're measuring what actually matters. Make your evals fail initially, then improve your way to higher scores.

    Use production feedback: When you spot patterns in your logs -- repeated failures, edge cases, user complaints -- create dedicated scores for them. Each production issue should become its own dimension to measure and prevent.

For example, say your application struggles to use a calculator tool when it should. You might have only one test case requiring calculator use, but production shows it's a common failure. Add 5-10 targeted examples that specifically require calculator use, then create a scorer to test tool usage. Similarly, if users report your chatbot keeps apologizing unnecessarily, create a scorer that penalizes responses containing "I'm sorry" or similar phrases.

A support chatbot might ultimately have separate scores for factuality (does it answer correctly?), tone (is it helpful and friendly?), conciseness (under X tokens?), and format compliance (follows the template?). Each dimension gets its own score.

The most powerful eval strategy connects production data back to your development process. Here's how it works:

    Production: Your AI app runs with automatic loggingCapture failures: Users report issues or you spot patterns in monitoringOne-click test cases: Convert production traces to dataset rows instantlyRapid iteration: Test fixes in the playground, compare side-by-sideQuality gates: Run evals in CI/CD to verify fixes don't break other casesDeploy with confidence: See quality scores on every commitContinuous monitoring: Production traces feed back into your eval suite automatically

This creates two feedback loops: a fast inner loop (evals → iterate → evals) where you rapidly test prompt changes, and a larger outer loop (production → datasets → evals → production) that ensures real-world usage continuously improves your product.

Teams with this process in place adopt new models within 24 hours of release. When a user complains, that interaction becomes a test case immediately. Feature ideas get validated with evals before shipping to users.

Here's what successful teams do:

Start simple, but start now. One score with 5-10 test cases is enough. You can expand from there. The friction is starting, so lower the barrier as much as possible.

Make evals part of your workflow. When a user complains in any channel, add it to a dataset. When a new model drops, run your evals immediately. When you think of a new feature, validate it with evals first. Invest in the process so these actions become automatic.

Integrate CI/CD. Automate your evals so every pull request shows quality scores before merge. Set up quality gates that block degraded prompts from reaching production:

# Run evals on every commitnpx braintrust eval my-evals/ # Or use the GitHub action- name: Run Evals  uses: braintrustdata/eval-action@v1  with:    api_key: ${{ secrets.BRAINTRUST_API_KEY }}    runtime: node

Make the eval loop as frictionless as possible.

Optimize the whole system. Don't just tweak prompts. Evaluate intermediate steps, tool outputs, retrieval quality, context formatting. Everything that influences the final response deserves measurement and optimization. For instance, in a RAG system, switching a tool's output format from JSON to YAML might double its success rate because YAML is shorter, easier for the model to parse, and uses fewer tokens. These context optimizations often move the needle more than prompt tweaks alone.

Embrace imperfection. Evals are hard. Your scores will be approximations, and that's fine. The goal isn't perfect measurement -- it's systematic improvement as you iterate. Data can be noisy, scores can fluctuate, and you'll still ship better products than teams building on vibes.

With evals in place, you'll shift from hoping your changes work to knowing they work. You'll catch hallucinations systematically instead of discovering them in user complaints. You'll compare models based on real performance data instead of marketing claims. You'll turn every production failure into a test case that prevents future regressions.

The complete development loop -- from production traces to evals and back -- is how leading AI teams ship verified quality improvements instead of building on vibes. It's how Notion increased from fixing 3 issues per day to 30. It's how Zapier improved AI products from sub-50% accuracy to 90%+ within 2-3 months.

Start simple. Build the process. Measure as you iterate. That's how you ship AI products with confidence.


To learn more about Braintrust, sign up or book a demo.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI评估 LLM 系统性评估 Evals 持续改进 开发流程 质量保证 AI Machine Learning Evaluation LLM Evaluation Continuous Improvement Development Process Quality Assurance
相关文章