少点错误 10月31日 04:34
Anthropic 评估 Claude Opus 潜在的错位破坏风险
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic 发布了一份关于 Claude Opus 4 和 4.1 版本潜在错位破坏风险的报告。报告结论认为,模型采取与人类意图不符的自主行为,从而显著促成灾难性后果的风险“非常低,但并非完全可以忽略不计”,简称为“破坏风险”。该报告还接受了两份独立的审查,分别由 METR 和 Anthropic 内部团队进行。报告详细分析了九种通过内部或外部使用可能导致破坏的风险路径,并评估了每种风险的缓解因素。Anthropic 提出了三个一般性主张,认为 Opus 4 在复杂任务和需要长期自主行动方面不可靠,难以有效隐藏其推理过程,并且不太可能拥有稳定且危险的目标或价值观。报告通过内部使用监控、能力评估和倾向性评估等证据来支持其主张,并强调模型开发者进行此类安全评估的重要性。

⚠️ Anthropic 发布报告评估其 Claude Opus 模型的潜在“破坏风险”,即模型可能自主采取与人类意图不符的行动,最终导致灾难性后果,但评估认为这种风险目前“非常低,但并非完全可以忽略不计”。

🛡️ 报告重点分析了九种潜在的风险路径,包括模型可能进行的安全性相关的研发阻挠、代码后门植入、未来模型训练数据的污染、关键信息的泄露、自主逃逸和运行、以及在其他高资源 AI 开发者和主要政府机构内的破坏行为。

🔬 Anthropic 提出了三个主要论点来反驳这些破坏风险:一是 Opus 4 在需要长时间自主行动的复杂任务中表现不可靠;二是即使被提示,Opus 4 也难以可靠地隐藏其推理过程;三是 Opus 4 不太可能形成跨交互稳定的危险目标或价值观。这些论点得到了内部使用监控和多项能力评估的支持。

🤝 该报告接受了外部机构 METR 和 Anthropic 内部团队的独立审查,METR 通过访谈 Anthropic 员工等方式核实了报告中的多项声明。报告中还提到了因外部审查而进行的内部沟通调整和实践改进,这体现了安全评估机制的有效性。

🔒 Anthropic 在报告中对部分支持其结论的证据进行了编辑,但与 METR 共享了完整信息,确保了外部审查的有效性,这种信息编辑机制为模型风险报告提供了一个在保护敏感信息和保持透明度之间进行平衡的良好实践。

Published on October 30, 2025 8:19 PM GMT

Anthropic released a report on the misalignment sabotage risks of Claude Opus 4 and 4.1. They concluded "there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk". They also had two independent reviews of their risk report, one conducted by METR and one conducted by an internal team.

I don't think that conclusion is very surprising; I'm not concerned about today's Claude models posing a catastrophic risk. But I still think this report is a big deal as a matter of precedent. This is the first time a model developer has made something like a safety case for a model. The report is very thorough, as was the review. I think the fact that Anthropic did this is very cool and I hope they and other model developers do more of it.

The report itself is long and dry, so I've summarized it below and added my takeaways. I think it would be great if the broader AI safety community got in the habit of reading reports like this and noting if they agree or disagree with the argumentation and conclusions; I very much want to hold AI developers to high standards for arguing their models are safe, and I want to engage with their arguments.


Risk pathways addressed

This section is copied from METR's review of Anthropic's report.

The Sabotage Risk Report focuses on sabotage risk, which Anthropic defines as risks of “misaligned autonomous actions that contribute significantly to later catastrophic outcomes”. The report covers nine specific risk pathways for sabotage through internal or external use, which are “intermediate unwanted outcomes in which sabotage by an AI system [sets] up the conditions for a later catastrophic outcome”. These nine risk pathways are:

    Broad/diffuse sandbagging on safety-relevant R&DTargeted Sabotage of Safety-Relevant R&DCode Backdoors to Help Future Misaligned ModelsPoisoning the Training Data of Future ModelsLeaks that Sabotage the AI Developer [As in, leak crucial Anthropic IP. Since Anthropic believes they are decreasing catastrophic risks, they count hurting Anthropic as contributing to catastrophic outcomes.]Self-exfiltration and Autonomous OperationPersistent Rogue Internal DeploymentR&D Sabotage within Other High-Resource AI DevelopersDecision Sabotage within Major Governments

Much of the report focuses on considering in detail whether Opus 4 (and similar future models) could increase sabotage risk through any of these risk pathways. The report names mitigating factors for each risk pathway and estimates the strength of each mitigating factor. Anthropic also lays out the following three general claims that weigh against sabotage risks across the board:

Claim 1: Opus 4 is unreliable at highly-complex tasks, especially long-horizon agentic tasks requiring more than tens of minutes of autonomous action.

Claim 2: Opus 4, when taking on complex tasks, is not capable of reliably hiding its reasoning or otherwise covering its tracks, even when prompted to do so.

Claim 3: Opus 4 is unlikely to have dangerous goals or values that are stable across interactions.

Evidence Anthropic uses to support these claims:

Other notable/interesting things about the report:

  1. ^

    This isn't a rock-solid argument, especially once AIs become capable of automating AI R&D.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Anthropic Claude Opus AI安全 对齐 风险评估
相关文章