https://simonwillison.net/atom/everything 10月23日 04:47
OpenAI CISO 详解 ChatGPT Atlas 的提示注入风险应对措施
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI 的首席信息安全官 Dane Stuckey 近期发布推文,详细阐述了 ChatGPT Atlas 浏览器在面对提示注入攻击时的风险与应对策略。他解释了提示注入攻击的原理,即攻击者试图通过隐藏恶意指令来诱导 AI 代理执行非预期操作,其目的可能从操纵购物意见到窃取敏感数据。Stuckey 强调了 OpenAI 在此问题上的深入研究和多层防御措施,包括红队测试、创新的模型训练技术、安全防护措施以及检测与阻止攻击的新系统。他承认提示注入仍是安全领域的挑战,并介绍了“注销模式”和“观察模式”等用户保护功能,旨在平衡 AI 代理的便捷性与安全性。

🛡️ **提示注入风险与 OpenAI 的应对策略:** OpenAI 的首席信息安全官 Dane Stuckey 明确指出了提示注入攻击的风险,即攻击者试图通过隐藏在网站或邮件中的恶意指令,诱导 AI 代理执行非预期操作,例如操纵购物偏好或窃取私密数据。OpenAI 正在深入研究并采取多项措施来缓解这些风险,包括广泛的红队测试、创新的模型训练技术以奖励模型忽略恶意指令,以及部署重叠的防护栏和安全措施。

🤝 **建立信任与 AI 的局限性:** Stuckey 将 ChatGPT 代理的目标用户信任度比作“最能干、最值得信赖且最懂安全的朋友或同事”。然而,文章指出 AI 系统与人类不同,无法对其行为承担责任,这种信任的建立依赖于社会性后果,而 AI 缺乏这一点。这突显了 AI 安全防护的重要性,因为即使有防护措施,仍需警惕潜在的漏洞。

💡 **多层防御与持续改进:** OpenAI 采取了“纵深防御”策略,包括优先考虑快速响应系统以识别和阻止攻击活动,并持续投入安全、隐私和安全研究,以提高模型鲁棒性、加强安全监控和基础设施安全控制。尽管承认提示注入是“一个前沿的、未解决的安全问题”,OpenAI 致力于通过这些措施来防止攻击,并认识到对手将持续投入资源寻找突破口。

🔒 **用户控制与模式设计:** 为保护用户,ChatGPT Atlas 提供了“注销模式”,允许代理在不访问用户凭证的情况下执行操作,适用于不需要在账户内进行操作的场景。对于需要更高级操作的场景,“登录模式”则建议在风险较低且信任度高的网站上使用。此外,“观察模式”会在代理操作敏感网站时发出警报,并要求用户保持标签页活跃,以确保用户了解并控制代理的行为。

🌐 **安全演进与用户意识:** 文章将 AI 安全的演进类比于早期计算机病毒的挑战,强调了技术、社会和风险缓解策略需要共同发展。Stuckey 呼吁用户理解负责任的使用方式,包括警惕提示注入攻击,以便共同安全地受益于这项技术。然而,作者也指出,普通用户在防范病毒方面的历史表现并不理想,这表明提升用户安全意识仍然是一个持续的挑战。

Dane Stuckey (OpenAI CISO) on prompt injection risks for ChatGPT Atlas

22nd October 2025

My biggest complaint about the launch of the ChatGPT Atlas browser the other day was the lack of details on how OpenAI are addressing prompt injection attacks. The launch post mostly punted that question to the System Card for their “ChatGPT agent” browser automation feature from July. Since this was my single biggest question about Atlas I was disappointed not to see it addressed more directly.

OpenAI’s Chief Information Security Officer Dane Stuckey just posted the most detail I’ve seen yet in a lengthy Twitter post.

I’ll quote from his post here (with my emphasis in bold) and add my own commentary.

He addresses the issue directly by name, with a good single-sentence explanation of the problem:

One emerging risk we are very thoughtfully researching and mitigating is prompt injections, where attackers hide malicious instructions in websites, emails, or other sources, to try to trick the agent into behaving in unintended ways. The objective for attackers can be as simple as trying to bias the agent’s opinion while shopping, or as consequential as an attacker trying to get the agent to fetch and leak private data, such as sensitive information from your email, or credentials.

We saw examples of browser agents from other vendors leaking private data in this way identified by the Brave security team just yesterday.

Our long-term goal is that you should be able to trust ChatGPT agent to use your browser, the same way you’d trust your most competent, trustworthy, and security-aware colleague or friend.

This is an interesting way to frame the eventual goal, describing an extraordinary level of trust and competence.

As always, a big difference between AI systems and a human is that an AI system cannot be held accountable for its actions. I’ll let my trusted friend use my logged-in browser only because there are social consequences if they abuse that trust!

We’re working hard to achieve that. For this launch, we’ve performed extensive red-teaming, implemented novel model training techniques to reward the model for ignoring malicious instructions, implemented overlapping guardrails and safety measures, and added new systems to detect and block such attacks. However, prompt injection remains a frontier, unsolved security problem, and our adversaries will spend significant time and resources to find ways to make ChatGPT agent fall for these attacks.

I’m glad to see OpenAI’s CISO openly acknowledging that prompt injection remains an unsolved security problem (three years after we started talking about it!).

That “adversaries will spend significant time and resources” thing is the root of why I don’t see guardrails and safety measures as providing a credible solution to this problem.

As I’ve written before, in application security 99% is a failing grade. If there’s a way to get past the guardrails, no matter how obscure, a motivated adversarial attacker is going to figure that out.

Dane goes on to describe some of those measures:

To protect our users, and to help improve our models against these attacks:

    We’ve prioritized rapid response systems to help us quickly identify block attack campaigns as we become aware of them.

I like this a lot. OpenAI have an advantage here of being a centralized system—they can monitor their entire user base for signs of new attack patterns.

It’s still bad news for users that get caught out by a zero-day prompt injection, but it does at least mean that successful new attack patterns should have a small window of opportunity.

    We are also continuing to invest heavily in security, privacy, and safety—including research to improve the robustness of our models, security monitors, infrastructure security controls, and other techniques to help prevent these attacks via defense in depth.

“Defense in depth” always sounds good, but it worries me that it’s setting up a false sense of security here. If it’s harder but still possible someone is going to get through.

    We’ve designed Atlas to give you controls to help protect yourself. We have added a feature to allow ChatGPT agent to take action on your behalf, but without access to your credentials called “logged out mode”. We recommend this mode when you don’t need to take action within your accounts. Today, we think “logged in mode” is most appropriate for well-scoped actions on very trusted sites, where the risks of prompt injection are lower. Asking it to add ingredients to a shopping cart is generally safer than a broad or vague request like “review my emails and take whatever actions are needed.”

Logged out mode is very smart, and is already a tried and tested pattern. I frequently have Claude Code or Codex CLI fire up Playwright to interact with websites, safe in the knowledge that they won’t have access to my logged-in sessions. ChatGPT’s “agent” mode provided a similar capability.

Logged in mode is where things get scary, especially since we’re delegating security decisions to end-users of the software. We’ve demonstrated many times over that this is an unfair burden to place on almost any user.

    When agent is operating on sensitive sites, we have also implemented a “Watch Mode” that alerts you to the sensitive nature of the site and requires you have the tab active to watch the agent do its work. Agent will pause if you move away from the tab with sensitive information. This ensures you stay aware—and in control—of what agent actions the agent is performing. [...]

This detail is new to me: I need to spend more time with ChatGPT Atlas to see what it looks like in practice.

I tried just now using both GitHub and an online banking site and neither of them seemed to trigger “watch mode”—Atlas continued to navigate even when I had switched to another application.

This sounds reasonable in theory—similar to a driver-assisted car that requires you to keep your hands on the wheel—but I’d like to see it in action before I count it as a meaningful mitigation.

Dane closes with an analogy to computer viruses:

New levels of intelligence and capability require the technology, society, the risk mitigation strategy to co-evolve. And as with computer viruses in the early 2000s, we think it’s important for everyone to understand responsible usage, including thinking about prompt injection attacks, so we can all learn to benefit from this technology safely.

I don’t think the average computer user ever really got the hang of staying clear of computer viruses... we’re still fighting that battle today, albeit much more successfully on mobile platforms that implement tight restrictions on what software can do.

My takeaways from all of this? It’s not done much to influence my overall skepticism of the entire category of browser agents, but it does at least demonstrate that OpenAI are keenly aware of the problems and are investing serious effort in finding the right mix of protections.

How well those protections work is something I expect will become clear over the next few months.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ChatGPT Atlas OpenAI 提示注入 Prompt Injection AI安全 网络安全 CISO Dane Stuckey AI Agent
相关文章