GitHub MCP Server：通过离线评估提升模型工具使用能力

MCP (Model Context Protocol) is a simple, common way for AI models (LLMs) to talk to APIs and data. Think of it like a universal plug: if both sides support MCP, they can connect and work together. An MCP server is any service or app that “speaks MCP” and offers tools the model can use, publishing a list of tools, what each tool does, and what inputs (parameters) each tool needs.

The GitHub MCP Server is the foundation for many GitHub Copilot workflows, both inside and outside of GitHub. As an engineering team working on GitHub MCP, we’re always looking to deliver new features and functionality, while avoiding regressions and improving quality with every iteration. And how we name a tool, explain what it does, and spell out its parameters directly affects whether the model picks the right tool, in the right order, with the right arguments.

When it comes to our work, small edits matter: tightening a description, adding or removing a tool, or combining a few similar tools can shift results a lot. When descriptions are off, agents choose the wrong tool, skip a step, send arguments in the wrong format, or drop them entirely. The outcome is weak. We need a safe way to change MCP and know if things actually got better, not worse. That’s where offline evaluation comes in.

Offline evaluation catches regressions before users see them and keeps the feedback loop short, so we can ship changes that genuinely improve performance.

This article walks through our evaluation pipeline and explains the metrics and algorithms that help us achieve these goals.

How automated offline evaluation works

Our offline evaluation pipeline checks how well our tool prompts work across different models. The tool instructions are kept simple and precise so the model can choose the right tool and fill in the correct parameters. Because LLMs vary in how they use tools, we systematically test each model–MCP pairing to measure compatibility, quality, and gaps.

Quick intro to MCP hosting

MCP host or agent connects to one or more MCP servers, fetches their tool lists, and passes that info—along with the user’s request—to the LLM.

How it works (step by step):

The MCP server exposes tools (with names, descriptions, and required inputs).The agent pulls that tool list and gives it to the model.When the user asks something, the LLM decides if it needs a tool.If yes, it picks a tool and fills in the inputs.The agent calls the tool on the MCP server and returns the result to the LLM, which then forms the final answer.

We have curated datasets that we use as benchmarks. Every benchmark contains the following parameters:

Input

Expected tools

Expected arguments

Here are a few examples:

Asking how many issues were created in a given time period

Input: How many issues were created in the github/github-mcp-server repository during April 2025?
Expected tools: list_issues with arguments:

owner: github repo: github-mcp-server since: 2025-04-01T00:00:00Z

Merging pull requests

Input: Merge PR 123 in github/docs using squash merge with title “Update installation guide”
Expected tools: merge_pull_request with arguments:

owner: githubrepo: docs pullNumber: 123 merge_method: squash commit_title: Update installation guide

Requesting code reviews

Input: Request reviews from alice456 and bob123 for PR 67 in team/project-alpha
Expected tools: update_pull_request with arguments:

owner: team repo: project-alpha pullNumber: 67reviewers: ["alice456", "bob123"]

Summarizing discussion comments

Input: Summarize the comments in discussion 33801, in the facebook/react repository
Expected tools: get_discussion_comments with arguments:

owner: facebookrepo: reactdiscussionNumber: 33801

The evaluation pipeline has three stages: fulfillment, evaluation, and summarization.

Fulfillment:

Evaluation:

Summarization:

Evaluation metrics and algorithms

Our evaluation targets two aspects: whether the model selects the correct tools and whether it supplies correct arguments.

Tool selection

When benchmarks involve a single tool call, tool selection reduces to a multi-class classification problem. Each benchmark is labeled with the tool it expects, and each tool is a “class.”

Models tasked with this classification are evaluated using accuracy, precision, recall, and F1-score.

Accuracy

Precision

Recall

F1-score

If the model confuses two tools, it can result in low precision or recall for these tools.

We have two similar tools that used to be confused often, which are list_issues and search_issues. Let’s say we have 10 benchmarks for list_issues and 10 benchmarks for search_issues. Imagine list_issues is called correctly in all of 10 cases and on top in 30% of cases where search_issues should be called.

This means we’re going to have lower recall for search_issues and lower precision for list_issues:

Precision (list_issues) = 10 (cases where tool is called correctly) / (10 + 3 (cases where tool is called instead of search_issues)) = 0.77

Recall (search_issues) = 7 (tool was called correctly) / 10 (cases where tool is expected to be called) = 0.7

In order to have visibility into what tools are confused with each other, we build a confusion matrix. Confusion matrix for the search_issues and list_issues tools from the example above would look the following:

Expected tool / Called tool	search_issues	list_issues
search_issues	7	3
list_issues	0	10

The confusion matrix allows us to see the reason behind low precision and recall for certain tools and tweak their descriptions to minimize confusion.

Argument correctness

Selecting the right tool isn’t enough. The model must also supply correct arguments. We’ve defined a set of argument-correctness metrics that pinpoint specific issues, making regressions easy to diagnose and fix.

We track four argument-quality metrics:

Argument hallucination:

All expected arguments provided:

All required arguments provided:

Exact value match:

These metrics are computed for tools that were correctly selected. The final report summarizes each tool’s performance across all four metrics.

Looking forward and filling the gaps

The current evaluation framework gives us a solid read on tool performance against curated datasets, but there’s still room to improve.

More is better

Benchmark volume is the weak point of offline evaluation. With so many classes (tools), we need more robust per-tool coverage. Evaluations based on just a couple of examples aren’t dependable alone. Adding more benchmarks is always useful to increase the reliability of classification evaluation and other metrics.

Evaluation of multi-tool flows

Our current pipeline handles only single tool calls. In practice, tools are often invoked sequentially, with later calls consuming the outputs of earlier ones. To evaluate these flows, we must go beyond fetching the MCP tool list and actually execute tool calls (or mock their responses) during evaluation.

We’ll also update summarization. Today we treat tool selection as multi-class classification, which assumes one tool per input. For flows where a single input can trigger multiple tools, multi-label classification is the better fit.

Take this with you

Offline evaluation gives us a fast, safe way to iterate on MCP, so models pick the right GitHub tools with the right arguments. By combining curated benchmarks with clear metrics—classification scores for tool selection and targeted checks for argument quality—we turn vague “it seems better” into measurable progress and actionable fixes.

We’re not stopping here. We’re expanding benchmark coverage, refining tool descriptions to reduce confusion, and extending the pipeline to handle real multi-tool flows with execution or faithful mocks. These investments mean fewer regressions, clearer insights, and more reliable agents that help developers move faster.

Most importantly, this work raises the bar for product quality without slowing delivery. As we grow the suite and deepen the evaluation, you can expect steadier improvements to GitHub MCP Server—and a better, more predictable experience for anyone building with it.

The post Measuring what matters: How offline evaluation of GitHub MCP Server works appeared first on The GitHub Blog.