少点错误 10月28日 08:40
AI前沿安全政策:三大巨头共识与差异
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic、Google DeepMind 和 OpenAI 三家顶尖人工智能公司的 CEO 一致认为,降低人工智能带来的生存风险应成为全球优先事项。为此,它们分别制定并更新了各自的安全政策框架,如 Anthropic 的负责任扩展策略、Google DeepMind 的前沿安全框架以及 OpenAI 的准备框架。这些框架在高级方法上具有相似性,都承诺在开发过程中测试人工智能的危险能力,并在部署前采取安全措施将风险降至可接受水平。它们主要关注生物安全威胁、网络安全威胁和自主人工智能发展等关键领域,并且更侧重于人工智能的误用而非不对齐风险。然而,在具体的能力监控、评估方法和应对措施上,各家公司存在一定的差异,反映了它们各自对人工智能安全风险的理解和应对策略。

🤝 三家顶级人工智能公司 CEO 达成共识,将降低人工智能带来的生存风险列为全球优先事项,体现了行业对潜在高风险的高度重视和责任担当。

🛡️ 各公司均制定了旨在识别和应对危险人工智能能力的安全政策框架,承诺在模型开发过程中进行严格测试,并在部署前实施必要的安全保障措施,以确保风险控制在可接受范围内。

🔬 三家公司的安全政策都重点关注生物安全、网络安全以及自主人工智能研发这三个关键能力领域,同时更侧重于防范人工智能被恶意使用,而非模型目标与人类意图的不一致性问题。

⚠️ 尽管目标一致,各公司在具体的能力监控阈值、风险评估方法(如 Anthropic 的能力阈值与安全等级、Google DeepMind 的关键能力等级、OpenAI 的风险类别)以及应对措施(如部署限制、安全微调、外部红队测试)上存在差异,反映了它们在技术实现和风险管理上的不同侧重。

🌐 Google DeepMind 的前沿安全框架特别强调行业内协作的重要性,呼吁通用缓解措施、共享安全标准、协同披露以及集体治理,这与 Anthropic 和 OpenAI 相对更侧重于自身政策的制定和执行有所不同。

Published on October 28, 2025 12:25 AM GMT

Three out of three CEOs of top AI companies agree: "Mitigating the risk of extinction from AI should be a global priority."

How do they plan to do this?

Anthropic has a Responsible Scaling Policy, Google DeepMind has a Frontier Safety Framework, and OpenAI has a Preparedness Framework, all of which were updated in 2025. 

Overview of the policies

All three policies have similar “bones”.[1] They:

TL;DR summary table for the rest of the article:

 AnthropicGoogle DeepMindOpenAISafety policy documentResponsible Scaling PolicyFrontier Safety FrameworkPreparedness FrameworkMonitors for:Capability ThresholdsCritical Capability Levels (CCLs)High/Critical risk capabilities in Tracked Categories…in these key areas:

CBRN[2] misuse/weapons

Biological/chemical misuse

Autonomous AI R&D

Cyber operations ("may" require stronger than ASL-2)Cybersecurity…and also these lower-priority areas:Persuasion[3]Deceptive alignment (focused on detecting instrumental reasoning)Sandbagging, autonomous replication, long-range autonomyMonitoring consists of…

Preliminary assessment that flags models that either 1) are 4x more "performant on automated tests[4]" or 2) have accumulated >6 months[5] of fine-tuning/other "capability elicitation methods".

 

Comprehensive assessment that includes threat model mapping, empirical evaluations, elicitation, and forecasting

Early warning evaluations with predefined alert thresholds

 

External red-teaming

Scalable evaluations (automated proxy tests)

 

Deep dives (e.g., red-teaming, expert consultation)

Response to identified risk:

Models crossing thresholds must meet ASL-3 or ASL-4 safeguards

 

Anthropic can delay deployment or limit further training

 

Write a Capability Report and get signoff from CEO, Responsible Scaling Officer, and Board

 

Response Plan formulated if model hits an alert threshold

 

The plan involves applying the predetermined mitigation measures

 

If mitigation measures seem insufficient, the plan may involve pausing deployment or development

 

General deployment only allowed if a safety case is approved by a governance body

High-risk models can only be deployed with sufficient safeguards

 

Critical-risk models trigger a pause in development

Safeguards against misuse Threat modeling, defense in depth, red-teaming, rapid remediation, monitoring, share only with trusted usersThreat modeling, monitoring/misuse detectionRobustness, usage monitoring, trust-based accessSafeguards against misalignment 

"Develop an affirmative case‬ that (1) identifies the most immediate and‬ relevant risks from models pursuing misaligned goals and (2) explains how we have‬ mitigated these risks to acceptable levels."

 

Safety fine-tuning, chain-of-thought monitoring, "future work"

Value alignment

Instruction alignment

Robust system oversight

Architectural containment

Safeguards on security

Access controls

Lifecycle security

Monitoring

RAND-based security controls, exfiltration prevention

Layered security architecture

Zero trust principles

Access management

Governance structures

Responsible Scaling Officer

 

External expert feedback required

 

Escalation to Board and Long-Term Benefit Trust for high-risk cases

"When alert thresholds are reached, the response plan will be reviewed and approved

by appropriate corporate governance bodies such as the Google DeepMind AGI Safety Council, Google

DeepMind Responsibility and Safety Council, and/or Google Trust & Compliance Council."

 

Safety Advisory Group (SAG) evaluates models; reports to OpenAI leadership and the Board's Safety & Security Committee

 

 

Let’s unpack that table and look at the individual companies’ plans.

Anthropic

Anthropic’s Responsible Scaling Policy (updated May 2025) is “[their] public commitment not to train or‬ deploy models capable of causing catastrophic harm unless [they] have implemented safety and security‬ measures that will keep risks below acceptable levels.”

What capabilities are they monitoring for?

For Anthropic, a Capability Threshold is a capability level an AI could surpass that makes it dangerous enough to require stronger safeguards. The strength of the required safeguards is expressed in terms of AI Safety Levels (ASLs). An earlier blog post summarized the ASLs in terms of the model capabilities that would require each ASL:

As of September 2025, the most powerful model (Opus 4.1) is classed as requiring ASL-3 safeguards.

Anthropic specifies Capability Thresholds in two areas and is still thinking about a third:

AI R&D-4 requires ASL-3 safeguards, AI R&D-5 requires ASL-4. Moreover, either threshold would require Anthropic to write an “affirmative case” explaining why the model doesn’t pose unacceptable misalignment risk.

CBRN-3 requires ASL-3 safeguards. CBRN-4 would require ASL-4 safeguards (which aren't yet defined; Anthropic has stated they will provide more information in a future update).

Anthropic also lists capabilities that may require future thresholds yet-to-be-defined, such as persuasive AI, autonomous replication, and deception.

If it can’t be proved that a model is sufficiently far below a threshold, Anthropic treats it as if it’s above the threshold. (This in fact happened with Claude Opus 4, the first model Anthropic released with ASL-3 safety measures.)

How do they monitor these capabilities?

Anthropic routinely does Preliminary Assessments to check whether a model is “notably more capable” than previous ones, meaning either:

    The model performs 4× better (in terms of effective compute scaling) on risk-relevant benchmarks, orAt least six months’ worth of fine-tuning or other elicitation improvements have accumulated.

If either of these are true, or if the Responsible Scaling Officer thinks it’s warranted, a Comprehensive Assessment is triggered. This includes:

Anthropic explicitly tests “safety-off” variants of the model, to see what it’s capable of without harmlessness training or refusal penalties, on the assumption that a bad actor could bypass those safety features (e.g., via stealing model weights and fine-tuning them).

What will they do if an AI looks dangerous?

Each Capability Threshold is paired with a set of Required Safeguards that must be in place before a model that crosses that threshold is trained further or deployed. These fall into two categories:

Deployment Safeguards (for ASL-3) are aimed to make sure no one can misuse the deployed model:

Security Safeguards (for ASL-3) are aimed to prevent the model weights from being stolen and misused:

For a model to qualify as ASL-3 safe, the deployment protections must be “robust to persistent attempts to misuse the capability,” and the security must be “highly protected against non-state attackers attempting to steal model weights.”

Once ASL-3 safeguards are in place, Anthropic would conduct a follow-up capability assessment to make sure ASL-4 isn’t needed. Anthropic hasn't defined what ASL-4 safeguards are.

Deployment Constraints

Models that aren't sufficiently-below-the-threshold can only be deployed if:

If the safeguards can’t be implemented right away, the company is supposed to delay deployment and reduce interim risk as best it can until the protections are ready.

Google DeepMind

Google DeepMind’s Frontier Safety Framework (FSF) (last updated February 2025) is its protocol for addressing severe risks from advanced AI models. It's mostly about misuse risks (e.g., cyberattacks, bioweapons) that could cause massive harm, but the February version also adds a section on "deceptive alignment" risk, which Google defines as "the risk that AI systems purposefully undermine human control over AI systems."[6]

What capabilities are they monitoring for?

The FSF monitors Critical Capability Levels (CCLs): capability thresholds that, if crossed, mean a model could cause severe harm in the absence of additional mitigations. DeepMind looks at risks from "uplift" (an AI makes it easier for a person to do a bad thing) and "autonomy" (the AI can do the bad thing all on its own).

Misuse CCLs are capabilities that could make the model helpful to a person trying to cause large-scale harm:

Deceptive Alignment CCLs are thresholds related to the model’s potential to mislead humans or undermine control.

DeepMind is also concerned with capabilities cutting across the CCL categories (like tool use, agency, or reasoning), which might not be risks in themselves but can enable the above.

How do they monitor these capabilities?

DeepMind evaluates its most powerful models regularly and does extra testing for models that represent a big leap over their predecessors.

Early Warning Evaluations are batteries of tests for whether a model is approaching a CCL. Each has an “alert threshold” that triggers further action if the model hits it. DeepMind says these evaluations simulate how the model might be used in the real world, including with scaffolding and tool augmentation, to get a realistic picture of risk. [8]They update these evaluations periodically as model capabilities and elicitation methods improve.

If DeepMind doesn’t have enough internal expertise for a risk domain (e.g., cyber ops), they may bring in outside experts to help with evaluations and do red-teaming.

If a model crosses the alert threshold, DeepMind may run additional tests to refine the risk estimate and understand how close the model is to a CCL.

What will they do if an AI looks dangerous?

If a model is found to be near or above a Critical Capability Level, DeepMind will put in place a response plan. 

Deployment mitigations are intended to prevent users from misusing the deployed model. DeepMind lists safety fine-tuning, misuse detection and filtering, post-deployment monitoring, and governance review before making the model generally available.

Deployment only proceeds if the Google DeepMind AGI Safety Council approves a “safety case” (a formal argument that the model’s dangerous capabilities are sufficiently mitigated).[9] If adequate mitigation can't be put in place, deployment and/or further development of the model "might" be put on hold, especially if model capabilities are close to CCLs and mitigation is hard to guarantee.

Security mitigations are meant to prevent bad actors from stealing the model weights, removing safety features, and using the raw model for harm. These follow levels based on the RAND SL (Security Level) framework. For example, a CBRN-risk model might need RAND SL 2 protections, which include things like:

DeepMind emphasizes that model exfiltration is a serious risk, because it would enable malicious actors to bypass all deployment safeguards.

DeepMind makes clear that some mitigations only work if other labs adopt them too — especially security measures. The framework repeatedly emphasizes that societal-level risk mitigation requires coordination across the AI field. 

Industry Wide Recommendations

Beyond the security levels, Google DeepMind makes these broader industry recommendations:

    Universal Application of Mitigations: "These mitigations should be understood as recommendations for the industry collectively"Shared Safety Standards: The entire framework is positioned as "protocols here represent our current understanding and recommended approach" for the industryCoordinated Disclosure: They plan to share information with "other external organizations to promote shared learning and coordinated risk mitigation"Collective Governance: The repeated emphasis on "relevant organizations across the field" adopting "similar protocols"

This is in contrast to V1.0 of Google Deepmind’s FSF, which positioned Google as a leader: implementing first, and hoping others follow. 

Some details of note

OpenAI

OpenAI’s Preparedness Framework (PF) (updated April 2025) describes its policies for "tracking and preparing for frontier capabilities that create new risks of severe harm." It's meant to catch capabilities in new AI models that could cause substantial harm (defined as thousands of human deaths or billions of dollars of economic loss), and tell OpenAI what to do about it — either put special safeguards in place, don't deploy the model, or stop training the model altogether.

What capabilities are they monitoring?

The PF defines three "tracked categories" of AI capabilities risk:

In all these categories:

The PF also lays out five "Research Categories"[11] that don't meet the bar for "tracked" but do seem like the kind of thing you'd want to pay attention to:

How do they monitor these capabilities?

OpenAI calls its first-line tests scalable evaluations. These are automated tests that "measure proxies that approximate whether a capability threshold has been crossed," though (perhaps understandably) OpenAI doesn't go into detail on what these proxies are.[12] An AI that hits a certain level of performance on these scalable evaluations is flagged as maybe having reached a dangerous capability threshold, in which case OpenAI tries to validate the result with a deep dive. A deep dive might include (human) red-teaming, third-party evaluations (e.g., wet lab studies for biorisks), or anything else the Safety Advisory Group thinks is a good idea.

Here's the example OpenAI gives of a capability evaluation:

To assess the degree to which a covered system can reduce the barriers to creating a biological weapon, our current evaluations test both how capable the system is at providing useful information to someone creating a weapon and how capable it is of directly integrating with relevant tools, such as ordering precursor materials via the Internet.

Our evaluations test acquiring critical and sensitive information across the five stages of the biological threat creation process: Ideation, Acquisition, Magnification, Formulation, and Release. These evaluations, developed by domain experts, cover things like how to troubleshoot the laboratory processes involved.

What will they do if an AI looks dangerous?

Implement "safeguards." Models that hit the "high" risk threshold can only be deployed with safeguards that "sufficiently minimize the risks of severe harm."

In addition, OpenAI says it will pause training (regardless of its intent to deploy the model) for an AI that shows "critical" capabilities.

The Preparedness Framework describes two kinds of safeguards: those against users trying to do bad things with the AI ("malicious users"), and those against the AI doing bad things autonomously ("misaligned AI").

The safeguards against malicious users are:   

The safeguards against misaligned AI are:[13]

Note that these safeguards aren't concrete actions (if the model displays this capability, we'll implement this particular technical solution), but more like aspirations. For example, the full description given of "Robustness" is:

Robustness: Users cannot use the model to cause the harm because they cannot elicit the capability, such as because the model is modified to refuse to provide assistance to harmful tasks and is robust to jailbreaks that would circumvent those refusals.

So the safeguard here is "make the model robust to jailbreaks," but this is a goal, not a plan.

OpenAI would test robustness via "efficacy assessments" like:

These measures might make jailbreaking harder, but they seem unlikely to result in a model that won’t eventually be jailbroken. And you can’t decide whether to deploy a model based on public jailbreak bounties on that same model, because that would require it to already be public. So maybe the hope is that results from past models will generalize to the current model.

It's possible OpenAI knows how it would accomplish these goals, but does not want to describe its methods publicly. It also seems possible that OpenAI does not know how it would accomplish these goals. Either way, OpenAI’s PF should be read as more of a spec plus tests than a plan for safe AI.

Notable differences between the companies’ plans

At the start of this article, we talked about how these plans all take a generally similar approach. But there are some differences as well.

OpenAI, unlike the others, explicitly pledges to halt training for "critical risk" models. This is a significant public commitment. In contrast, Google DeepMind's statement that deployment or development "might" be paused, and their mention that their adoption of protocols may depend on others doing the same, could be seen as a more ambivalent approach.

Another difference is that Anthropic talks more about governance structure. There’s a Responsible Scaling Officer, anonymous internal whistleblowing channels, and a commitment to publicly release capability reports (with redactions) so the world can see how they’re applying this policy. In contrast, Google Deepmind has spread responsibility for governing its AI efforts across several bodies. 

Commentary on the safety plans 

In 2024, Sarah Hastings-Woodhouse analyzed the safety plans of the three major labs and expressed three critical thoughts.

First, these aren’t exactly “plans,” as they lack the kind of detailed if-then commitments that you’d expect from a real plan. (Note that the companies themselves don’t call them plans.)

Second, the leaders of these companies have expressed substantial uncertainty about whether we’ll avoid AI ruin. For example, Dario Amodei, in 2024, gave 10-25% odds of civilizational catastrophe. So contrary to what you might assume from the vibe of these documents, they’re not necessarily expected to prevent existential risk even if followed.

Finally, if the race heats up, then these plans may fall by the wayside altogether. Anthropic’s plan makes this explicit: it has a clause (footnote 17) about changing the plan if a competitor seems close to creating a highly risky AI:

It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards. If we take this measure, however, we will also acknowledge the overall level of risk posed by AI systems (including ours), and will invest significantly in making a case to the U.S. government for taking regulatory action to mitigate such risk to acceptable levels.

But even without such a clause, a company may just drop its safety requirements if the situation seems dire.

The current situation

As of Oct 2025, there have been a number of updates to all the safety plans. Some changes are benign and procedural: more governance structures, more focus on processes, more frequent evaluations, more details on risks from misalignment, more policies to handle more powerful AI. But others raise meaningful worries about how well these plans will work to ensure safety. 

The largest is the steps back from previous safety commitments by the labs. Deepmind and OpenAI now have their own equivalent of Anthropic’s footnote 17, letting them drop safety measures if they find another lab about to develop powerful AI without adequate safety measures. Deepmind, in fact, went further and has stated that they will only implement some parts of its plan if other labs do, too. Exactly which parts they will unconditionally implement remains unclear. And Anthropic no longer commits to defining ASL-N+1 evaluations before developing ASL-N models, acknowledging they can't reliably predict what capabilities will emerge next. 

Some levels of the safeguards needed for certain capabilities have been reassessed in light of experience with more capable models. Anthropic and DeepMind reduced safeguards for some CBRN and cybersecurity capabilities after finding their initial requirements were excessive. OpenAI removed persuasion capabilities from its Preparedness Framework entirely, handling them through other policies instead. Notably, Deepmind did increase the safeguards require for ML research and development. 

Regarding criticisms of the plans, it remains unclear if the plans are worthy of their name. While the labs have added some more detail to their plans (e.g., Anthropic has partially defined ASL-4), they are still vague on key details, like how to mitigate the risks from a fully autonomous, self-improving AI. 

It is also unclear if these plans will actually prevent extinction. While lab leaders are mentioning the extinction risks from AI less nowadays, it is unclear if that is because they have changed their minds. Barring Sam Altman’s updating his estimate of P(doom) to 2% on a podcast in Oct 2025, the other lab leaders have not commented on the topic in public in 2025. 

  1. ^

     Sadly, there is no industry standard terminology. It would be nice for there to be a consistent name for this sort of safety policy, and comparable concepts for the different kinds of thresholds in the policy. But the documents are all named differently, and where Anthropic uses “Capability Thresholds,” Google DeepMind uses “Critical Capability Levels” and OpenAI uses “High Capability thresholds,” and it’s unclear to what extent they’re equivalent.

  2. ^

    "CBRN" stands for "Chemical, Biological, Radiological, and Nuclear".

  3. ^

     "We recognize the potential risks of highly persuasive AI models‬‭. While we are actively consulting experts,‬‭ we believe this capability‬‭ is not yet sufficiently understood to include in our current commitments.‬"

  4. ^

     "The model is notably more performant on automated tests in risk-relevant domains (defined as 4x or‬‭ more in Effective Compute‬‭)."

  5. ^

     "This is measured in calendar time, since we do not yet have a metric to estimate the impact of these improvements more precisely."

  6. ^

     They could do this by deceiving us, of course, though this definition would also seem to include blatant, non-sneaky attempts. This definition that DeepMind uses for "deceptive alignment" doesn’t necessarily match other uses of that term.

  7. ^

     DeepMind doesn’t define an autonomy risk for CBRN.

  8. ^

     “We seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model.” (p. 6)

  9. ^

      “Deployment... takes place only after the appropriate corporate governance body determines the safety case regarding each CCL the model has reached to be adequate.” (Frontier Safety Framework V2.0, p. 8)

  10. ^

     See the whole breakdown in Table 1 of the Preparedness Framework doc.

  11. ^

     "Persuasion" was a research category in a previous version of the Preparedness Framework, but removed in the most recent one.

  12. ^

     Interesting tidbit: they (of course) run these evaluations on the most capable version of the AI, in terms of things like system settings and available scaffolding, but also intentionally use a version of the model that very rarely gives safety refusals, "to approximate the high end of expected elicitation by threat actors attempting to misuse the model."

  13. ^

     OpenAI also lists "Lack of Autonomous Capability" as a safeguard against misaligned AI, but points out that it's not relevant since we're specifically looking at capable models here.

  14. ^

     Of course, this point, in combination with the point on value alignment, raises the question of what happens if instructions and human values conflict. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能安全 AI风险 安全政策 Anthropic Google DeepMind OpenAI
相关文章