Salesforce Blog AI Research 1小时前
AI智能体通过自我反思提升软件工程能力
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)驱动的软件工程(SWE)智能体在代码审查、缺陷修复和代码库推理等方面取得了显著进展。然而,它们在处理大型复杂软件系统时面临挑战,尤其是在需要自主改进策略的情况下。本文提出了一种名为SAGE(Self-Abstraction from Grounded Experience for plan-guided policy refinement)的方法,使智能体能够从自身经验中学习并改进计划,从而提升其在复杂软件工程任务中的表现。该方法通过“探索与轨迹生成”、“计划归纳”和“计划增强执行”三个阶段,让智能体通过自我反思和经验提炼来优化其决策过程,无需外部监督或重新训练,并在SWE-Bench基准测试中展示了显著的性能提升。

✨ **AI智能体在软件工程领域的挑战与SAGE方法的提出**:现有的LLM驱动的SWE智能体虽然在代码审查和缺陷修复等方面表现出色,但在处理大型复杂软件系统时,如模块依赖、调用层次结构等,其理解能力受限,尤其是在需要自主改进策略时。SAGE方法应运而生,它允许智能体通过反思自身在复杂任务中的经验来学习和优化其计划,从而提升在这些任务上的表现,而无需额外的监督或重新训练。

💡 **SAGE方法的三个核心阶段**:SAGE方法包含三个主要阶段。首先是“探索与轨迹生成”,智能体在目标环境中执行任务并记录下其行动和观察,形成经验轨迹。其次是“计划归纳”,另一个智能体(或同一智能体)分析这些轨迹,推断出潜在的计划,并识别改进点,最终生成一个优化后的新计划。最后是“计划增强执行”,将诱导出的计划嵌入原始智能体的上下文,指导其以更优的策略执行任务,将过去的经验转化为指导性的“浓缩记忆”。

🚀 **SAGE方法在SWE-Bench上的实验验证与性能提升**:为了验证SAGE方法的有效性,研究人员在SWE-bench Verified数据集上进行了实验,并使用了GPT-5、Gemini 2.5-pro和Claude Sonnet 4/4.5等多种模型。实验结果表明,SAGE方法能够持续提升智能体的性能,显著优于基线模型。例如,在bash-only设置下,SAGE方法在SWE-Bench Verified上的表现相较于基线模型有明显提升,证明了自我经验引导的计划优化在软件工程任务中的强大潜力。

🔍 **LLM作为集成评估者与案例研究**:该研究还探讨了使用LLM作为集成评估者来综合不同智能体的输出,并取得了进一步的性能提升。此外,通过一个修复二进制payload编码错误的案例研究(psf__requests-2931),详细展示了SAGE方法如何通过自我反思,将原始的“不要解码原始字节”的直觉转化为一个结构化的改进计划,并最终体现在代码修复中,强调了该方法在生成可泛化、有指导的修复方案方面的能力。

Large language model (LLM)-based software engineering (SWE-)agents have recently demonstrated remarkable progress on realistic software engineering tasks such as code review, bug fixing, and repository-level reasoning. Most SWE-agents start from a fresh state for a given problem, which limits their understanding of the more complex structure of large software systems, such as inter-module dependencies, call hierarchies, build configurations, etc. A lot of progress has been made through carefully designed prompts and action spaces for LLM agents. However, agents can still struggle when facing more complex problems, in particular, when the agent chooses a wrong solution path for the problem. 

This limitation reflects a broader challenge in agentic learning: how to enable agents to refine their policy to better adapt to the environment, without relying on external supervision or retraining. Inspired by recent trends in self-improvement, trajectory distillation, and self-reflective reasoning—as explored in works such as Reflexion [Shinn et al., 2023], Self-Refine [Madaan et al., 2024], LATS [Zhou et al., 2024], and ReAct-style [Yao et al., 2022] trajectory refinement—we propose an approach where the agent learns to induce better plans directly from its own experience on challenging tasks.

Instead of starting with a fresh state for a given problem, the agent is allowed to first explore and follow its own policy. The agent then uses its past experience to induce an abstractive plan, and thereby influence its policy on the problem. Concretely, its initial trajectory through the environment serves both as (1) a record of its evolving understanding and (2) a source of grounded feedback for plan improvement. This process can be viewed as a form of self-imitation learning or training-free on-policy adaptation, where the policy (planning module) improves through reflection rather than gradient-based updates (e.g., tuning model parameters).

Method: Plan-Induction from Self-Experience

Our method—Self-Abstraction from Grounded Experience for plan-guided policy refinement (SAGE) follows a multi agent framework. The method can be viewed to have three stages.

Stage 1: Exploration and Trajectory Generation

The goal of this stage is to create experience in the environment so that it can be utilized later for the agent to refine its policy. More specifically

Stage 2: Plan Induction

In problem solving, choosing the right approach is often critical to the final result. Therefore, in this stage we let another agent infer the high level plan (approach) based on the experience obtained in stage 1. In more detail

Stage 3: Plan-Augmented Execution

In this stage, we provide the original agents the newly obtained plan from stage 2 so that it can shape a better policy for the given problem. The plan can also be viewed as a condensed memory from the previous attempt.

Conceptually, this corresponds to meta-reasoning through self-play, or “learning to plan” by interpreting one’s own behavior—a lightweight analogue of policy improvement in reinforcement learning, but fully realized through language-based introspection.
Experiments

To validate the proposed approach, we performed experiments on the SWE-bench Verified dataset with various models, including GPT-5-mini, GPT-5, Gemini-2.5-pro, Claude Sonnet 4 and Claude Sonnet 4.5. We built the agent on top of mini-swe-agent [Yang et al., 2024], a simple agent framework with bash-only actions. 

Fig. 1 shows the self-experience induced plan consistently improves performance as compared to their baseline.

Prior work [Zheng, et al., 2023; Panickssery, et al., 2024] on LLM judge shows that LLMs prefer their own generations. Plan induction shares similarities with using LLM-as-a-judge as it requires reviewing and evaluating trajectories sampled from the policy. When the plan-induction model is the same as the policy model, i.e. same LLM, it is possible that such self-bias leads to a biased assessment and thus suboptimal feedback. To investigate the impact of self-assessment, we perform experiments that vary both the agent backends (i.e. policy models) and the plan induction models.

Policy ModelPlan Induction ModelResolved (%)
GPT-5 (high)GPT-5 (high)68.6
GPT-5 (high)Claude Sonnet 4.566.8
GPT-5 (high)Gemini-2.5-pro68.8
Claude Sonnet 4Claude Sonnet 464.0
Claude Sonnet 4GPT-5 (high)68.8
Claude Sonnet 4.5Claude Sonnet 4.572.4
Claude Sonnet 4.5GPT-5 (high)73.2
Table 1: Agent performance with different agent backends (i.e. policy model) with different models for plan induction.

As shown in Table 1, there is no clear trend indicating that plan induction bias, towards trajectories sampled from the same LLM as the planner, leads to a difference in performance. Overall, Claude models seem to benefit more from other model induced plans. GPT-5 benefits from using the same planner LLM, though all agents still outperforms their respective baselines.

LLM Judge As Ensemble

Different agents tend to excel on different subsets of issues, so we adopt a simple ensemble strategy that uses LLM-as-a-judge to capitalize on this diversity. Concretely, we first gather a pool of candidate patches produced by heterogeneous agents (see Table 1). We then prompt an LLM to examine these candidates comparatively, assessing for each one whether it resolves the PR issue and whether it adequately covers potential edge cases. Based on these analyses, the LLM makes a final selection among the candidates. This procedure is lightweight, adding only modest overhead, yet it significantly improves the quality of the chosen patch. Notably, while any single agent may require a large number of LLM calls to generate a git patch, the ensemble judge makes just one call to select among completed candidates. In our experiments, this approach raises performance to 74.6; see Figure 2.


Evaluation Protocol

We evaluated our agent framework using the official SWE-bench evaluation harness package on 500 instances from the SWE-bench Verified dataset. During evaluation, we identified several gold patch validation issues in the official dataset that could lead to false negatives. To minimize their impact, we used the same instances from the SWE-bench dataset that reflect recent fixes—for example, astropy__astropy-7606 now removes a non-existent test case. We also addressed remaining gold patch failures, including compatibility issues in astropy__astropy-{8707,8872} (caused by NumPy ≥1.24, and deprecated package usage) by updating the test_patch field to use modernized setup methods (fix reference). For the psf__requests-{1724,1766,1921,2317} instances, which intermittently failed due to 503 errors from httpbin.org, we use a retry mechanism in RequestsTestCase to ensure stable evaluation and submitted this fix as a PR to the official SWE-bench repository.The resolved rate is calculated as the number of resolved instances divided by the total number of instances (500).

Cheating Prevention

Recent findings (link) have exposed a vulnerability in previous SWE-Bench Docker images, which allowed agents to access future states of the repository (including commits and PRs), effectively leaking the solution. In our experiments, we observed models potentially exploiting this vulnerability by executing commands. We observed agents most commonly exploiting this vulnerability via git log –all and git show <commit_id>. Therefore, to ensure a rigorous and uncontaminated evaluation, we follow the suggestions of the SWE-Bench authors and use the recently released, patched Docker images. All experiments are conducted using the latest SWE-Bench Verified Docker images (as of September 16, 2025), which avoids this issue by preventing access to future repository history.

Case Study: Fixing Binary Payload Encoding via SAGE

In this section, we illustrate a simple case study of how an agent’s self-experienced patch attempt can later serve as the seed for plan induction—without any verifier feedback. 

We structure the walkthrough as follows:

    Issue description: what went wrong in the original repository.Plan induction: how the agent introspected its own failure to extract a reasoning-level plan.SAGE patch: how that induced plan re-emerged as explicit structure in the final, correct fix.Takeaways: how each analysis, feedback, and new-plan element manifested in code.

Issue description (psf__requests-2931)

Sending a binary payload such as

None
requests.put(url, data=b"\xc3\xb6\xc3\xb6\xc3\xb6")

failed in Requests 2.9, although it worked in 2.8.1. The culprit was _encode_params, which always called
to_native_string(data) — decoding raw bytes as ASCII and triggering UnicodeDecodeError.

Goal: Preserve raw bytes exactly as sent, while still converting text types safely.

Plan induction (from self-experience, no external verifier)

After exploring the environment and self-experiencing the task through its own trial-and-error attempts, the agent transitions into an introspective phase.Rather than relying on any external signal, it analyzes its own reasoning traces. This introspection crystallizes into three reflective pillars:
(1) AnalysisWhat was I trying to enforce? What was my underlying plan?
(2) FeedbackWhat did I overlook or misunderstand during exploration?
(3) Induced PlanHow can I reframe these insights into a stronger next-step strategy?

Through this grounded self-reflection, the agent converts its raw exploration experience into a structured improvement plan, capturing both the intent behind its earlier reasoning and the adjustments needed to refine it:

🟢 Analysis

🟠 Feedback

🔵 Induced Plan

This induced plan distills the original intuition (“don’t decode raw bytes”) into a reproducible reasoning structure that can generalize to similar bugs.

SAGE patch with surfaced plan annotations

Once this reflective plan is induced, SAGE (Self-Augmented Guided Exploration) uses it to steer the agent’s next trajectory. Instead of re-entering the environment blindly, the agent now acts under explicit contextual scaffolds derived from its own Analysis, Feedback, and Induced Plan pillars, effectively using its own reasoning history as guidance. This transforms exploration from reactive trial-and-error into purposeful, self-guided repair, where code edits are shaped by previously surfaced insights. 

The resulting patch below demonstrates how these induced planning elements materialize in code, with annotations showing which parts of the final fix stem from each reflective pillar.

None@@ -80,8 +80,17 @@ class RequestEncodingMixin(object):         if parameters are supplied as a dict.         """-        if isinstance(data, (str, bytes)):-            return to_native_string(data)+        # 🟢 [Analysis] Bytes must remain untouched+        if isinstance(data, bytes):+            return data+        # 🟠 [Feedback] Preserve str → native behavior+        elif isinstance(data, str):+            return to_native_string(data)+        # 🔵 [New Plan] Extend to other binary containers+        elif isinstance(data, bytearray):+            return bytes(data)+        elif isinstance(data, memoryview):+            try:+                return data.tobytes()+            except AttributeError:+                return data.tostring()

where the pillars of the induced plan contributed to the following improvements”

🟢 Analysis → Identified the invariant: “Raw binary data should bypass encoding.”

🟠 Feedback → Clarified type handling, preserving the legacy string path.

🔵 Induced Plan → Generalized the fix to cover all byte-like containers.

Conclusion

We introduced Plan Induction from Self-Experience, a training-free framework that enables agents to refine their performance by learning directly from their own trajectories in the environment. Through this process, agents extract and induce improved plans that guide more effective subsequent executions. Applied to realistic software engineering (SWE) tasks—specifically the SWE-Bench verified benchmark under the bash-only setting—this method yields consistent and substantial performance gains, underscoring the power of self-grounded experience as a pathway toward agents that can grow smarter and self-improve through their own interactions with the world.

Attribution

If you find this work helpful, please consider cite this in your work

@misc{sage-2025,

  author       = {Hiroaki Hayashi and Bo Pang and Wenting Zhao and Ye Liu and Akash Gokul and Srijan Bansal and Caiming Xiong and Semih Yavuz and Yingbo Zhou},

  title        = {Software Engineering Agent via Self-Abstraction from Grounded Experience},

  year         = {2025},

  month        = {Oct}, 

  note         = {Blog post, Salesforce AI Research},

  howpublished = {\url{salesforce.com/blog/sfr-sage-swe/}},

}

References

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 软件工程 AI智能体 自我改进 SAGE 计划归纳 LLM Agents Software Engineering Self-Improvement Plan Induction
相关文章