Published on October 30, 2025 8:19 PM GMT
Anthropic released a report on the misalignment sabotage risks of Claude Opus 4 and 4.1. They concluded "there is a very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes, abbreviated as sabotage risk". They also had two independent reviews of their risk report, one conducted by METR and one conducted by an internal team.
I don't think that conclusion is very surprising; I'm not concerned about today's Claude models posing a catastrophic risk. But I still think this report is a big deal as a matter of precedent. This is the first time a model developer has made something like a safety case for a model. The report is very thorough, as was the review. I think the fact that Anthropic did this is very cool and I hope they and other model developers do more of it.
The report itself is long and dry, so I've summarized it below and added my takeaways. I think it would be great if the broader AI safety community got in the habit of reading reports like this and noting if they agree or disagree with the argumentation and conclusions; I very much want to hold AI developers to high standards for arguing their models are safe, and I want to engage with their arguments.
Risk pathways addressed
This section is copied from METR's review of Anthropic's report.
The Sabotage Risk Report focuses on sabotage risk, which Anthropic defines as risks of “misaligned autonomous actions that contribute significantly to later catastrophic outcomes”. The report covers nine specific risk pathways for sabotage through internal or external use, which are “intermediate unwanted outcomes in which sabotage by an AI system [sets] up the conditions for a later catastrophic outcome”. These nine risk pathways are:
- Broad/diffuse sandbagging on safety-relevant R&DTargeted Sabotage of Safety-Relevant R&DCode Backdoors to Help Future Misaligned ModelsPoisoning the Training Data of Future ModelsLeaks that Sabotage the AI Developer [As in, leak crucial Anthropic IP. Since Anthropic believes they are decreasing catastrophic risks, they count hurting Anthropic as contributing to catastrophic outcomes.]Self-exfiltration and Autonomous OperationPersistent Rogue Internal DeploymentR&D Sabotage within Other High-Resource AI DevelopersDecision Sabotage within Major Governments
Much of the report focuses on considering in detail whether Opus 4 (and similar future models) could increase sabotage risk through any of these risk pathways. The report names mitigating factors for each risk pathway and estimates the strength of each mitigating factor. Anthropic also lays out the following three general claims that weigh against sabotage risks across the board:
Claim 1: Opus 4 is unreliable at highly-complex tasks, especially long-horizon agentic tasks requiring more than tens of minutes of autonomous action.
Claim 2: Opus 4, when taking on complex tasks, is not capable of reliably hiding its reasoning or otherwise covering its tracks, even when prompted to do so.
Claim 3: Opus 4 is unlikely to have dangerous goals or values that are stable across interactions.
Evidence Anthropic uses to support these claims:
- Monitoring Claude usage internally (in a number of different ways) and noting they didn't observe any major security threats. The most systematic monitoring was automated Claude Code usage monitoring for ~90% of employees and PR reviews (both manual and automated checks on each PR). They validated their monitoring with red-teaming.
- Since the monitoring wasn't comprehensive, it mostly serves to provide evidence that Claude doesn't have high propensity for writing sabotage code and that it wouldn't be able to execute sabotage super often without getting caught. It seems like very rare sabotage might slip through.
Other notable/interesting things about the report:
- "Our training process did apply some optimization pressure that likely disincentivized these models from revealing harmful information in their extended thinking text, and could have compromised faithfulness indirectly in other ways.""There had been internal miscommunications about this topic [whether Anthropic training might cause the CoT to not reveal when the model is pursuing harmful actions] that became clear only in response to questions from METR, and our work with METR helped accelerate some changes in our practices."
- External review caught something, and Anthropic took actions in response! The system works!
- In general I think this kind of redaction scheme seems like a great way to do risk reports without leaking sensitive information.
- The threat models where the evidence against them isn't that strong are (1) "Leaks that Sabotage the AI Developer" (discussed in an earlier bullet), (2) "Persistent Rogue Internal Deployment" (the main evidence against is lack of propensity and rogue internal deployments being too complicated for Claude to be able to execute), and (3) "Poisoning the Training Data of Future Models" (the main evidence against is lack of propensity and monitoring).
- ^
This isn't a rock-solid argument, especially once AIs become capable of automating AI R&D.
Discuss
