The GitHub Blog 10月31日 06:00
GitHub MCP Server:通过离线评估提升模型工具使用能力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了GitHub MCP (Model Context Protocol) Server的离线评估流程,旨在确保模型能准确选择和使用API工具。通过构建包含用户请求、预期工具及参数的数据集,并采用精确的评估指标(如准确率、精确率、召回率、F1分数以及参数正确性度量),GitHub团队能够系统性地测试不同模型与MCP服务器的兼容性、质量和潜在问题。评估流程分为三个阶段:实现(Fulfillment)、评估(Evaluation)和总结(Summarization),并重点关注工具选择和参数正确性。文章还讨论了未来改进方向,包括增加基准测试数量和评估多工具协同工作的能力,以持续提升GitHub Copilot等应用的性能和用户体验。

📝 **MCP (Model Context Protocol) 的核心作用**:MCP是一种通用的AI模型与API和服务交互的标准协议,它允许模型理解并使用外部工具。MCP服务器发布其可用工具列表、功能描述及所需参数,为模型提供服务能力。GitHub MCP Server是GitHub Copilot等工作流的基础,其工具命名、描述和参数定义直接影响模型选择的准确性。

🛠️ **离线评估的重要性与流程**:为了在不影响用户体验的前提下改进MCP工具,GitHub采用了离线评估机制。该流程通过在预设的基准数据集上运行模型,记录其工具调用情况,并进行量化评估。评估流程包含三个阶段:Fulfillment(运行模型并记录工具调用)、Evaluation(计算各项评估指标)和Summarization(生成最终报告),旨在捕捉回归问题并缩短反馈周期。

🎯 **关键评估指标**:评估的核心在于两方面:工具选择的准确性(使用准确率、精确率、召回率、F1分数及混淆矩阵来衡量模型是否能正确识别和调用工具)和参数提供的正确性(通过“参数幻觉”、“所有预期参数提供”、“所有必需参数提供”和“精确值匹配”等指标,检查模型是否能为工具提供正确的参数及其值)。

📊 **未来展望与挑战**:当前评估框架虽有效,但仍有改进空间。主要挑战在于增加基准测试的数量以提高评估的可靠性,以及扩展评估能力以处理多工具协同工作的复杂场景。通过持续优化评估体系,GitHub致力于提升MCP Server的稳定性和用户体验,使开发者能更高效地工作。

MCP (Model Context Protocol) is a simple, common way for AI models (LLMs) to talk to APIs and data. Think of it like a universal plug: if both sides support MCP, they can connect and work together. An MCP server is any service or app that “speaks MCP” and offers tools the model can use, publishing a list of tools, what each tool does, and what inputs (parameters) each tool needs. 

The GitHub MCP Server is the foundation for many GitHub Copilot workflows, both inside and outside of GitHub. As an engineering team working on GitHub MCP, we’re always looking to deliver new features and functionality, while avoiding regressions and improving quality with every iteration. And how we name a tool, explain what it does, and spell out its parameters directly affects whether the model picks the right tool, in the right order, with the right arguments. 

When it comes to our work, small edits matter: tightening a description, adding or removing a tool, or combining a few similar tools can shift results a lot. When descriptions are off, agents choose the wrong tool, skip a step, send arguments in the wrong format, or drop them entirely. The outcome is weak. We need a safe way to change MCP and know if things actually got better, not worse. That’s where offline evaluation comes in.

Offline evaluation catches regressions before users see them and keeps the feedback loop short, so we can ship changes that genuinely improve performance.

This article walks through our evaluation pipeline and explains the metrics and algorithms that help us achieve these goals.

How automated offline evaluation works

Our offline evaluation pipeline checks how well our tool prompts work across different models. The tool instructions are kept simple and precise so the model can choose the right tool and fill in the correct parameters. Because LLMs vary in how they use tools, we systematically test each model–MCP pairing to measure compatibility, quality, and gaps.

Quick intro to MCP hosting

MCP host or agent connects to one or more MCP servers, fetches their tool lists, and passes that info—along with the user’s request—to the LLM.

How it works (step by step):

    The MCP server exposes tools (with names, descriptions, and required inputs).The agent pulls that tool list and gives it to the model.When the user asks something, the LLM decides if it needs a tool.If yes, it picks a tool and fills in the inputs.The agent calls the tool on the MCP server and returns the result to the LLM, which then forms the final answer.

We have curated datasets that we use as benchmarks. Every benchmark contains the following parameters: 

    Input: This is a user request formulated in natural language. Expected tools: Tools we expect to be called.Expected arguments: Arguments we expect to be passed to each tool.

Here are a few examples:

Asking how many issues were created in a given time period

Input:  How many issues were created in the github/github-mcp-server repository during April 2025? 
Expected tools: list_issues with arguments:

owner: github repo: github-mcp-server since: 2025-04-01T00:00:00Z

Merging pull requests

Input: Merge PR 123 in github/docs using squash merge with title “Update installation guide”
Expected tools: merge_pull_request with arguments:

owner: githubrepo: docs pullNumber: 123 merge_method: squash commit_title: Update installation guide

Requesting code reviews

Input: Request reviews from alice456 and bob123 for PR 67 in team/project-alpha
Expected tools: update_pull_request with arguments: 

owner: team repo: project-alpha pullNumber: 67reviewers: ["alice456", "bob123"]

Summarizing discussion comments

Input: Summarize the comments in discussion 33801, in the facebook/react repository 
Expected tools: get_discussion_comments with arguments:

owner: facebookrepo: reactdiscussionNumber: 33801

The evaluation pipeline has three stages: fulfillment, evaluation, and summarization.

Evaluation metrics and algorithms

Our evaluation targets two aspects: whether the model selects the correct tools and whether it supplies correct arguments.

Tool selection

When benchmarks involve a single tool call, tool selection reduces to a multi-class classification problem. Each benchmark is labeled with the tool it expects, and each tool is a “class.”

Models tasked with this classification are evaluated using accuracy, precision, recall, and F1-score.

    Accuracy is the simplest measure that shows the percentage of correct classifications. In our case it means the percentage of inputs that resulted in an expected tool call. This is calculated on the whole dataset.Precision shows the proportion of the cases for which the tool was called correctly out of all cases where the tool was called. Low precision means the model picks the tool even for the cases where the tool is not expected to be called.Recall shows the proportion of correctly called tools out of all cases where the given tool call was expected. Low recall may indicate that the model doesn’t understand that the tool needs to be called and fails to call the tool or calls another tool instead.F1-score is a harmonic mean showing how well the model is doing in terms of both precision and recall. 

If the model confuses two tools, it can result in low precision or recall for these tools.

We have two similar tools that used to be confused often, which are list_issues and search_issues. Let’s say we have 10 benchmarks for list_issues  and 10 benchmarks for search_issues. Imagine list_issues is called correctly in all of 10 cases and on top in 30% of cases where search_issues should be called.

This means we’re going to have lower recall for search_issues and lower precision for list_issues:

Precision (list_issues) = 10 (cases where tool is called correctly) / (10 + 3 (cases where tool is called instead of search_issues)) = 0.77

Recall (search_issues) =  7 (tool was called correctly) / 10 (cases where tool is expected to be called) = 0.7

In order to have visibility into what tools are confused with each other, we build a confusion matrix. Confusion matrix for the search_issues and list_issues tools from the example above would look the following:

Expected tool / Called toolsearch_issueslist_issues
search_issues73
list_issues010

The confusion matrix allows us to see the reason behind low precision and recall for certain tools and tweak their descriptions to minimize confusion.

Argument correctness

Selecting the right tool isn’t enough. The model must also supply correct arguments. We’ve defined a set of argument-correctness metrics that pinpoint specific issues, making regressions easy to diagnose and fix.

We track four argument-quality metrics:

These metrics are computed for tools that were correctly selected. The final report summarizes each tool’s performance across all four metrics.

Looking forward and filling the gaps

The current evaluation framework gives us a solid read on tool performance against curated datasets, but there’s still room to improve.

More is better

Benchmark volume is the weak point of offline evaluation. With so many classes (tools), we need more robust per-tool coverage. Evaluations based on just a couple of examples aren’t dependable alone. Adding more benchmarks is always useful to increase the reliability of classification evaluation and other metrics.

Evaluation of multi-tool flows

Our current pipeline handles only single tool calls. In practice, tools are often invoked sequentially, with later calls consuming the outputs of earlier ones. To evaluate these flows, we must go beyond fetching the MCP tool list and actually execute tool calls (or mock their responses) during evaluation.

We’ll also update summarization. Today we treat tool selection as multi-class classification, which assumes one tool per input. For flows where a single input can trigger multiple tools, multi-label classification is the better fit.

Take this with you

Offline evaluation gives us a fast, safe way to iterate on MCP, so models pick the right GitHub tools with the right arguments. By combining curated benchmarks with clear metrics—classification scores for tool selection and targeted checks for argument quality—we turn vague “it seems better” into measurable progress and actionable fixes.

We’re not stopping here. We’re expanding benchmark coverage, refining tool descriptions to reduce confusion, and extending the pipeline to handle real multi-tool flows with execution or faithful mocks. These investments mean fewer regressions, clearer insights, and more reliable agents that help developers move faster.

Most importantly, this work raises the bar for product quality without slowing delivery. As we grow the suite and deepen the evaluation, you can expect steadier improvements to GitHub MCP Server—and a better, more predictable experience for anyone building with it.

The post Measuring what matters: How offline evaluation of GitHub MCP Server works appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MCP Model Context Protocol GitHub Copilot AI工具 离线评估 API集成 机器学习 模型性能 Model Evaluation Tool Usage AI Development
相关文章