MarkTechPost@AI 01月05日
This AI Paper Introduces SWE-Gym: A Comprehensive Training Environment for Real-World Software Engineering Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SWE-Gym是由UC Berkeley等机构研究人员开发的,专为训练软件工程Agent而设计的全新环境。它整合了来自GitHub的2438个Python任务,提供了预配置的可执行环境和专家验证的测试用例。SWE-Gym通过结合真实世界的任务复杂性和自动化测试机制,为语言模型创建了一个更有效的训练生态系统。该平台解决了现有基准测试的局限性,提供了一个可扩展的、现实的环境,从而显著提升了软件工程Agent的性能,并在SWE-Bench基准测试中取得了显著的性能提升。

💪SWE-Gym 整合了来自11个GitHub仓库的2438个Python任务,提供预配置的可执行环境和专家验证的测试用例,模拟真实软件开发场景。

🛠️ 该平台通过GitHub问题、仓库快照和单元测试,重现真实编码条件,并经过200小时的人工标注和10000CPU核时验证,确保了环境的准确性。

🚀 SWE-Gym 的引入,使得使用Qwen-2.5 Coder模型微调的Agent在SWE-Bench基准测试上的解决率显著提升,验证集上从20.6%提高到32.0%,轻量级版本从15.3%提高到26.0%。

✨ 研究人员还通过在SWE-Gym中采样Agent轨迹训练验证器,实现了推理时扩展,使得Agent能生成多个解决方案,并选择最优解,在SWE-Bench验证集上取得了32.0%的Best@K分数。

Software engineering agents have become essential for managing complex coding tasks, particularly in large repositories. These agents employ advanced language models to interpret natural language descriptions, analyze codebases, and implement modifications. Their applications include debugging, feature development, and optimization. The effectiveness of these systems relies on their ability to handle real-world challenges, such as interacting with extensive repositories and executing tests to validate solutions, making the development of such agents both exciting and challenging.

Lack of comprehensive training environments is one of the primary challenges in this domain. Many existing datasets and benchmarks, such as SWE-Bench and R2E, either focus on isolated problems or rely on synthetic instructions that do not represent the complexities of real-world coding tasks. For instance, while SWE-Bench offers test cases for validation, its training dataset lacks executable environments and dependency configurations. This discrepancy limits the utility of existing benchmarks for training agents capable of addressing the nuanced challenges of software engineering.

Efforts to address these challenges have previously relied on tools like HumanEval and APPS, which support isolated task evaluation but fail to integrate repository-level complexities. These tools often lack a coherent link between natural language problem descriptions, executable codebases, and comprehensive testing frameworks. As a result, there is a pressing need for a platform that can bridge these gaps by offering real-world tasks within functional and executable environments.

Researchers from UC Berkeley, UIUC, CMU, and Apple have developed SWE-Gym, a novel environment tailored for training software engineering agents. SWE-Gym integrates 2,438 Python tasks sourced from GitHub issues across 11 repositories, offering pre-configured executable environments and expert-validated test cases. This platform introduces a groundbreaking approach by combining real-world task complexity with automated testing mechanisms, creating a more effective training ecosystem for language models.

SWE-Gym’s methodology focuses on replicating real-world coding conditions. The tasks are derived from GitHub issues and paired with the corresponding repository snapshots and unit tests. Dependencies for each task are meticulously configured, ensuring the accuracy of the executable environment. These configurations were semi-manually validated through rigorous processes involving around 200 human annotation hours and 10,000 CPU core hours, resulting in a robust training dataset. The researchers also introduced a subset of 230 tasks, SWE-Gym Lite, which targets simpler and self-contained problems, enabling rapid prototyping and evaluation.

The performance evaluation of SWE-Gym demonstrated its significant impact on training software engineering agents. Using the Qwen-2.5 Coder model, fine-tuned agents achieved marked improvements in resolving tasks on SWE-Bench benchmarks. Specifically, resolve rates increased from 20.6% to 32.0% on SWE-Bench Verified and from 15.3% to 26.0% on SWE-Bench Lite. These gains represent a significant leap over previous benchmarks for open-weight language models. Furthermore, SWE-Gym-supported agents reduced failure rates in stuck-in-loop scenarios by 18.6% and improved task completion rates in real-world settings.

The researchers also explored inference-time scaling by employing a verifier trained on agent trajectories sampled from SWE-Gym. This approach allowed agents to generate multiple solution trajectories for a given problem, selecting the most promising one using a reward model. The verifier achieved a Best@K score of 32.0% on SWE-Bench Verified, demonstrating the environment’s capacity for improving agent performance through scalable compute strategies. These results emphasize the potential of SWE-Gym to enhance both the development and evaluation of software engineering agents.

SWE-Gym is a pivotal tool in advancing research on software engineering agents. Addressing the limitations of prior benchmarks and offering a scalable, realistic environment equips researchers with the resources needed to develop robust models capable of solving complex software challenges. With its open-source release, SWE-Gym paves the way for significant advancements in the field, setting new standards for the training and evaluation of software engineering agents.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post This AI Paper Introduces SWE-Gym: A Comprehensive Training Environment for Real-World Software Engineering Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SWE-Gym 软件工程Agent 机器学习 自然语言处理
相关文章