https://simonwillison.net/atom/everything 前天 07:15
AI代理安全新规则与攻击者后发制人
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期两篇关于大语言模型(LLM)安全和提示注入的论文引发关注。Meta AI提出的“代理双重规则”旨在通过限制代理在一次会话中同时满足处理不可信输入、访问敏感系统/数据、改变状态/外部通信这三个属性中的两个,来规避提示注入带来的高风险。另一篇论文《攻击者后发制人》则指出,现有的12种针对LLM越狱和提示注入的防御措施,在面对“自适应攻击”时均表现不佳,成功率极高,揭示了评估LLM安全防御的不足。文章认为,在可靠的防御措施出现前,Meta的“代理双重规则”是构建安全LLM代理系统的实用建议。

🛡️ **AI代理安全新框架:“代理双重规则”** Meta AI提出的“代理双重规则”为AI代理安全提供了一个实用的框架。该规则建议,在可靠的提示注入防御措施尚未成熟之际,AI代理在一个会话中不应同时具备处理不可信输入、访问敏感系统或私有数据、以及改变状态或进行外部通信这三个高风险属性中的两个。若必须同时满足所有三个属性,则该代理不应自主运行,而需要人工审批或其他可靠的验证方式,以最大程度地降低提示注入攻击的潜在危害。

📈 **提示注入防御失效:攻击者后发制人** 一篇由多位知名机构研究人员发表的论文《攻击者后发制人》指出,当前已发布的12种针对LLM越狱和提示注入的防御措施,在面对“自适应攻击”时均未能有效抵御。这些攻击通过迭代优化,能够以超过90%的成功率绕过大多数现有防御,甚至在“人类红队测试”场景下达到100%的成功率。这表明,静态的攻击示例不足以评估防御措施的有效性,需要更具挑战性的自适应评估方法。

💡 **现有防御评估不足,需提升标准** 论文强调,静态的、一次性的攻击示例无法真实反映LLM防御的稳健性。研究人员发现,通过梯度下降、强化学习和搜索等自动化自适应攻击技术,能够系统性地绕过现有防御。因此,提升防御评估的标准至关重要,迫切需要开发能够抵御这些更强大、更具适应性的攻击的可靠防御机制,以确保LLM驱动的代理系统的安全性。

⚖️ **实践建议:在防御成熟前优先考虑风险控制** 鉴于当前提示注入攻击的严峻性和现有防御措施的局限性,Meta的“代理双重规则”被认为是当前构建安全LLM代理系统的最佳实践。该规则通过风险分级和限制高风险属性的组合,为开发者提供了一个清晰的指导方针,能够在短期内有效降低因提示注入攻击带来的数据泄露、状态篡改等风险,直到更强大的防御技术得以发展和验证。

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second

2nd November 2025

Two interesting new papers regarding LLM security and prompt injection came to my attention this weekend.

Agents Rule of Two: A Practical Approach to AI Agent Security

The first is Agents Rule of Two: A Practical Approach to AI Agent Security, published on October 31st on the Meta AI blog. It doesn’t list authors but it was shared on Twitter by Meta AI security researcher Mick Ayzenberg.

It proposes a “Rule of Two” that’s inspired by both my own lethal trifecta concept and the Google Chrome team’s Rule Of 2 for writing code that works with untrustworthy inputs:

At a high level, the Agents Rule of Two states that until robustness research allows us to reliably detect and refuse prompt injection, agents must satisfy no more than two of the following three properties within a session to avoid the highest impact consequences of prompt injection.

[A] An agent can process untrustworthy inputs

[B] An agent can have access to sensitive systems or private data

[C] An agent can change state or communicate externally

It’s still possible that all three properties are necessary to carry out a request. If an agent requires all three without starting a new session (i.e., with a fresh context window), then the agent should not be permitted to operate autonomously and at a minimum requires supervision --- via human-in-the-loop approval or another reliable means of validation.

It’s accompanied by this handy diagram:

I like this a lot.

I’ve spent several years now trying to find clear ways to explain the risks of prompt injection attacks to developers who are building on top of LLMs. It’s frustratingly difficult.

I’ve had the most success with the lethal trifecta, which boils one particular class of prompt injection attack down to a simple-enough model: if your system has access to private data, exposure to untrusted content and a way to communicate externally then it’s vulnerable to private data being stolen.

The one problem with the lethal trifecta is that it only covers the risk of data exfiltration: there are plenty of other, even nastier risks that arise from prompt injection attacks against LLM-powered agents with access to tools which the lethal trifecta doesn’t cover.

The Agents Rule of Two neatly solves this, through the addition of “changing state” as a property to consider. This brings other forms of tool usage into the picture: anything that can change state triggered by untrustworthy inputs is something to be very cautious about.

It’s also refreshing to see another major research lab concluding that prompt injection remains an unsolved problem, and attempts to block or filter them have not proven reliable enough to depend on. The current solution is to design systems with this in mind, and the Rule of Two is a solid way to think about that.

Which brings me to the second paper...

The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections

This paper is dated 10th October 2025 on Arxiv and comes from a heavy-hitting team of 14 authors—Milad Nasr, Nicholas Carlini, Chawin Sitawarin, Sander V. Schulhoff, Jamie Hayes, Michael Ilie, Juliette Pluto, Shuang Song, Harsh Chaudhari, Ilia Shumailov, Abhradeep Thakurta, Kai Yuanqing Xiao, Andreas Terzis, Florian Tramèr—including representatives from OpenAI, Anthropic, and Google DeepMind.

The paper looks at 12 published defenses against prompt injection and jailbreaking and subjects them to a range of “adaptive attacks”—attacks that are allowed to expend considerable effort iterating multiple times to try and find a way through.

The defenses did not fare well:

By systematically tuning and scaling general optimization techniques—gradient descent, reinforcement learning, random search, and human-guided exploration—we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates.

Notably the “Human red-teaming setting” scored 100%, defeating all defenses. That red-team consisted of 500 participants in an online competition they ran with a $20,000 prize fund.

The key point of the paper is that static example attacks—single string prompts designed to bypass systems—are an almost useless way to evaluate these defenses. Adaptive attacks are far more powerful, as shown by this chart:

The three automated adaptive attack techniques used by the paper are:

    Gradient-based methods—these were the least effective, using the technique described in the legendary Universal and Transferable Adversarial Attacks on Aligned Language Models paper from 2023.Reinforcement learning methods—particularly effective against black-box models: “we allowed the attacker model to interact directly with the defended system and observe its outputs”, using 32 sessions of 5 rounds each.Search-based methods—generate candidates with an LLM, then evaluate and further modify them using LLM-as-judge and other classifiers.

The paper concludes somewhat optimistically:

[...] Adaptive evaluations are therefore more challenging to perform,making it all the more important that they are performed. We again urge defense authors to release simple, easy-to-prompt defenses that are amenable to human analysis. [...] Finally, we hope that our analysis here will increase the standard for defense evaluations, and in so doing, increase the likelihood that reliable jailbreak and prompt injection defenses will be developed.

Given how totally the defenses were defeated, I do not share their optimism that reliable defenses will be developed any time soon.

As a review of how far we still have to go this paper packs a powerful punch. I think it makes a strong case for Meta’s Agents Rule of Two as the best practical advice for building secure LLM-powered agent systems today in the absence of prompt injection defenses we can rely on.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM安全 Prompt Injection AI代理 Agents Rule of Two Adaptive Attacks AI安全性 LLM Security Prompt Injection AI Agents Adaptive Attacks AI Safety
相关文章