The GitHub Blog 10月24日 04:35
GitHub Copilot:新模型带来更快的代码补全体验
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GitHub Copilot的最新模型更新显著提升了代码补全的效率和质量。通过优化模型训练和评估流程,新模型实现了20%的接受和保留字符数增加、12%的接受率提升、3倍的吞吐量以及35%的延迟降低。这些改进旨在让开发者花费更少时间在编辑上,更多时间用于构建。文章详细介绍了模型训练的三个关键阶段:离线评估、预生产评估和生产评估,以及如何通过多维度指标来确保补全的准确性、相关性和实用性。同时,文章也分享了在训练过程中遇到的挑战和学到的经验,并展望了未来的发展方向。

🚀 **性能显著提升,优化开发者体验**:新版GitHub Copilot通过对底层模型进行迭代优化,带来了多项关键性能提升,包括接受和保留字符数增加20%、接受率提高12%、吞吐量提升3倍以及延迟降低35%。这些改进旨在减少开发者的编辑时间,让他们更专注于创造性工作,提供更流畅、更高效的编码体验。

🎯 **多维度评估确保高质量补全**:Copilot的模型评估过程涵盖离线基准测试(包括执行和LLM评分)、预生产的定性内部测试(dogfooding)以及生产环境的A/B测试。通过结合功能正确性、开发者偏好、代码质量、相关性和帮助性等多个维度进行评估,确保了补全建议不仅能通过编译,更能符合开发者的实际需求和习惯。

🧠 **精细化训练打造代码智能助手**:新模型的训练过程包括代码相关的预训练(mid-training)、监督微调(supervised fine-tuning)以及自定义强化学习。特别是针对代码补全任务,通过混合多种训练目标(如跨度填充、文档字符串/函数对),并优化前缀/后缀感知和格式保真度,使得模型能够生成更准确、更具上下文感知且风格一致的代码建议。

💡 **从实践中学习,持续迭代优化**:文章分享了在模型开发过程中获得的宝贵经验,例如谨慎设计奖励机制以避免“奖励破解”,强调多指标的重要性而非过度依赖单一指标,以及根据真实世界的使用场景调整训练数据。这些学习成果是GitHub Copilot持续改进和提供卓越开发者体验的基础。

Code completion remains the most widely used GitHub Copilot feature, helping millions of developers stay in the flow every day. Our team has continuously iterated on the custom models powering the completions experience in GitHub Copilot driven by developer feedback. That work has had a big impact on giving you faster, more relevant suggestions in the editor.  

We’re now delivering suggestions with 20% more accepted and retained characters, 12% higher acceptance rate, 3x higher token-per-second throughput, and a 35% reduction in latency. 

These updates now power GitHub Copilot across editors and environments. We’d like to share our journey on how we trained and evaluated our custom model for code completions. 

Why it matters 

When Copilot completions improve, you spend less time editing and more time building. The original Copilot was optimized for the highest acceptance rate possible. However, we realized that a heavy focus on acceptance rates could lead to incorrectly favoring a high volume of simple and short suggestions.  

We heard your feedback that this didn’t reflect real developer needs or deliver the highest quality experience. So, we pivoted to also optimize for accepted and retained characters, code flow, and other metrics. 

How we evaluate custom models 

Copilot models are evaluated using combined signals from offline, pre-production, and production evaluations. Each layer helps us refine different aspects of the experience while ensuring better quality in real developer workflows. 

1) Offline evaluations  

Execution-based benchmark: As part of our offline evaluations, we first test against internal and public repositories with strong code by unit test and scenario coverage, spanning all major languages. Each test simulates real tasks, accepts suggestions, and measures build-and-test pass rates. This emphasizes functional correctness over surface fluency.  

Below is an example of a partial token completion error: the model produced dataet instead of dataset.

LLM-judge scoring: While we start with execution-based evaluation, this has downsides: it only tells if the code will compile, but the results are not always aligned with developer preferences. To ensure the best possible outcomes, we run an independent LLM to score completions across three axes:  

2) Pre-production evaluations: Qualitative dogfooding 

Our next step includes working with internal developers and partners to test models side-by-side in real workflows (to do the latter, we exposed the preview model to developers through Copilot’s model picker). We collect structured feedback on readability, trust, and “taste.” Part of this process includes working with language experts to improve overall completion quality. This is unique: while execution-based testing, LLM-based evaluations, dogfood testing, and A/B testing are common, we find language-specific evaluations lead to better outcomes along quality and style preferences. 

3) Production-based evaluations: A/B testing 

Ultimately, the lived experience of developers like you is what matters most. We measure improvements using accepted-and-retained characters, acceptance rates, completion-shown rate, time-to-first token, latency, and many other metrics. We ship only when statistically significant improvements hold up under real developer workloads. 

How we trained our new Copilot completions model 

Mid-training 

Modern codebases use modern APIs. Before fine-tuning, we build a code-specific foundational model via mid-training using a curated, de-duplicated corpus of modern, idiomatic, public, and internal code with nearly 10M repositories and 600-plus programming languages. (Mid-training refers to the stage after the base model has been pretrained on a very large, diverse corpus, but before it undergoes final fine-tuning or instruction-tuning). 

This is a critical step to ensure behaviors, new language syntax, and recent API versions are utilized by the model. We then use supervised fine- tuning and reinforcement learning while mixing objectives beyond next-token prediction—span infillings and docstring/function pairs—so the model learns structure, naming, and intent, not just next-token prediction. This helps us make the foundational model code-fluent, style-consistent, and context-aware, ready for more targeted fine-tuning via supervised fine-tuning. 

Supervised fine-tuning 

Newer general-purpose chat models perform well in natural language to generate code, but underperform on fill-in-the-middle (FIM) code completion. In practice, chat models experience cursor-misaligned inserts, duplication of code before the cursor (prefix), and overwrites of code after the cursor (suffix).  

As we moved to fine-tuned behaviors, we trained models specialized in completions by way of synthetic fine-tuning to behave like a great FIM engine. In practice, this improves: 

The result is significantly improved FIM performance. For example, here is a benchmark comparing our latest completions model to GPT-4.1-mini on OpenAI’s HumanEval Infilling Benchmarks.  

Reinforcement learning 

Finally, we used a custom reinforcement learning algorithm, teaching the model through rewards and penalties to internalize what makes code suggestions useful in real developer scenarios along three axes:

Together, these create completions that are correct, relevant, and genuinely useful at the cursor instead of being verbose or superficially helpful. 

What we learned 

After talking with programming language experts and finding success in our prompt-based approach, one of our most important lessons was adding related files like C++ header files to our training data. Beyond this, we also came away with three key learnings: 

What’s next 

We’re continuing to push the frontier of Copilot completions by: 

Experience faster, smarter code completions yourself. Try GitHub Copilot in VS Code > 

Acknowledgments 

First, a big shoutout to our developer community for continuing to give us feedback and push us to deliver the best possible experiences with GitHub Copilot. Moreover, a huge thanks to the researchers, engineers, product managers, designers across GitHub and Microsoft who curated the training data, built the training pipeline, evaluation suites, client and serving stack and to the GitHub Copilot product and engineering teams for smooth model releases. 

The post The road to better completions: Building a faster, smarter GitHub Copilot with a new custom model appeared first on The GitHub Blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GitHub Copilot 代码补全 AI 机器学习 开发者工具 Code Completion AI Machine Learning Developer Tools
相关文章