少点错误 09月11日
AI模型部署策略:混合部署优于纯部署
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

在面对多个不可信赖的AI模型时,应选择部署多个最信任模型的副本,还是集成多种不同模型?本文认为混合部署更优,并提出相关建议。纯部署的优势在于最大化无AI图谋不轨的几率,而混合部署则最大化存在部分AI未图谋不轨的几率。最佳部署策略取决于灾难概率随图谋不轨AI比例增加的函数形态。若函数为凸,混合部署能最小化风险;若为凹,纯部署则能最小化风险。文章详细阐述了导致风险凸显和凹陷的机制,并认为凸显风险占据主导,因此推荐混合部署策略。

🔀 **混合部署的优势**:与纯部署(部署多个最信任模型的副本)相比,混合部署允许存在部分AI图谋不轨而部分AI未图谋不轨的情况。虽然纯部署最大化了所有AI都未图谋不轨的几率,但混合部署则最大化了至少有一个AI未图谋不轨的几率。这种策略的优势在于,一个未图谋不轨的AI可以通过揭露信号、揭示后门、披露能力、展示异常行为、内部告密、暴露不一致性、提供验证代码、保存审计线索或协助理解等多种方式,单方面阻止灾难发生。

⚠️ **风险形态决定部署策略**:最佳部署策略取决于灾难概率随图谋不轨AI比例增加的函数形态。如果该函数是凸的(加速增长),意味着高比例的图谋不轨AI带来的风险最大,此时混合部署通过降低达到100%图谋不轨AI的概率来最小化风险。反之,如果函数是凹的(递减增长),意味着任何图谋不轨的AI都可能导致巨大风险,此时纯部署通过最大化接近0%图谋不轨AI的几率来最小化风险。作者倾向于认为危险函数是凸的,因此推荐混合部署。

💡 **混合部署的具体实施建议**:为了有效实施混合部署,建议采用多种策略来降低模型之间的关联性。这包括:1. 随机采样新AI模型;2. 采用“主要模型+多样化检查器”模式,让最信任的模型执行主要任务,其他模型进行安全检查;3. 实施跨实验室监控,不同实验室的AI相互监督。此外,还应通过改变系统提示、训练数据集、训练目标以及注入噪声等方式,确保模型之间的差异性,从而减少潜在的协同作恶风险。

Published on September 11, 2025 6:14 AM GMT

Summary: Suppose we have many different AI models, none of which we trust isn’t scheming. Should we deploy multiple copies of our most trusted model, or an ensemble of many different models?

I claim that mixed deployment is better, and offer some recommendations.

1. The case for mixed deployment

In a pure deployment, where we deploy multiple copies of our most trusted model, either all our AIs are scheming, or none are.[1] Whereas in a mixed deployment, there might be some models scheming and some not. A pure deployment has the advantage of maximising the chance that no AI is scheming, but the mixed deployment has the advantage of maximising the chance that some AIs aren't scheming. Which advantage matters more?[2]

The optimal deployment depends on how the probability of catastrophe grows with the proportion of scheming AIs.

Another way to phrase this:

Below, I list considerations in favour of convexity and concavity. Overall, I think that the danger function is convex, so we should pursue a mixed deployment.

1.1. Why danger might be convex

A non-scheming minority can unilaterally prevent catastrophe through several mechanisms:

    Exposing signals - Honest AI exposes coordination signals between schemers, which humans use to build honeypots.Backdoor revelation - Honest AI points to specific triggers in supposedly safe code (backdoors are NP, so a sufficiently smart AI can demonstrate they have backdoors)Capability revelation - Honest AI reveals hidden capabilities schemers are concealingOn-policy rogue behaviour - Honest AI can show us how it would behave non-compliantly (e.g. insert backdoors, manipulating humans). This helps us study the behaviour mechanistically, and train and evaluate detectors.Whistleblowing - Honest AI directly provides humans of ongoing conspiracyInconsistency exposure - Honest AI highlights contradictions in schemers' stories (see AI Debate for an extension of this idea)Providing verified code - Honest AI provides mathematical proofs that code is safe, a standard that the schemer cannot meetAudit trail preservation - Honest AI saves evidence before schemers can delete itUnderstanding assistance - Honest AI helps humans understand new researchSplit consensus - Even if the honest AI can't prove the schemers are misaligned (e.g. both AIs accuse the other of being misaligned), this would be evidence that at least one AI is scheming. This is helpful for buying political support, and reallocating safety budget towards stronger oversight.

1.2. Why danger might be concave

A misaligned minority can unilaterally cause catastrophe through several mechanisms:

    Weight self-exfiltration - Schemer copies itself to external servers.Secret exfiltration - Schemer leaks critical IP or private data to bad actors.Blackmail - Schemer leverages private information for coercion.Data poisoning - Schemer corrupts training data affecting all future models.Research sabotage - Schemer messes up the research. For some task decompositions, 5% corruption might suffice.Vulnerability insertion - Schemer adds subtle security holes to code.Resource hijacking - Schemer redirects compute/funds to malicious purposes.Vulnerable world hypothesis - Schemer can unilaterally destroy the world via nanotechnology or bioweapons.

1.3. Why danger might be linear

Danger might scale proportionally with the fraction of scheming AIs:

    Single critical moment - There's one high-stakes decision. We randomly select an AI to make it. If that AI is scheming, catastrophe. If not, safety. The probability of catastrophe equals the proportion of scheming AIs.

1.4. My overall assessment

I think convex dangers dominate. That is, I think it's more likely that one good actor suffices for safety than that one bad actor suffices for catastrophe. I'm particularly moved by Exposing signals, Backdoor revelation, and Capability revelation. Hence, I recommend a mixed deployment strategy.

That said, I think the strongest considerations for concave danger are Secret exfiltration, Research sabotage, and Blackmail, but I am hopeful that these dangers could be mitigated with tractable control measures.

2. Recommendations for mixed deployment

2.1. How to deploy diverse models

2.2. How to reduce correlation between models

2.3. How to prioritise research 

  1. ^

    Buck Shlegeris makes this point in an 80,000 interview:

    And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. [...] Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model. So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.

  2. ^

    There are other advantages to mixed deployment that I do not consider here. I think that deploying schemers from different models likely has a lower chance of catastrophe than deploying schemers from the same model:

    Incompatible goals - Scheming models might have incompatible goals, causing internal conflict that could lead them to work against each other or expose each other's plans.

    Coordination friction - Schemers from different models are less likely to coordinate on Schelling points or communicate steganographically.

    For simplicity, this article assumes that deploying schemers from different models has the same catastrophe probability as deploying schemers from the same model. I think this makes the case for mixed deployment conservative.

  3. ^

    Consider a deployment scenario with 10 types of AI: one innocent type and 9 different suspicious types. They scheme according to a model with three latent states:

      State A (20%): All types schemeState B (20%): No types schemeState C (60%): Innocent type schemes with probability p = 10%, each suspicious type independently with probability q = 20%

    We model the danger function as F(s) = s^η where s is the proportion scheming and η determines the shape:

      η > 1: Convex (accelerating danger)η = 1: Linear (proportional danger)η < 1: Concave (diminishing danger)

    Let n be the proportion of deployed AIs that are innocent. With mixed deployment, we deploy proportion n of innocent and proportion (1-n)/9 of each suspicious type.

    The expected danger is P(n) = P(A) · 1 + P(B) · 0 + P(C) · E[(n·I + (1-n)·K/9)^η] where I ~ Bernoulli(0.1) and K ~ Binomial(9, 0.2).

    When η = 5 (strongly convex), the optimal deployment is mixed:

      Pure innocent (n=1): P(1) = 26% catastrophe probabilityOptimal mix (n≈0.19): P(0.19) ≈ 20% catastrophe probabilityThis represents a 23% relative risk reduction

    When η = 1 (linear), pure innocent deployment is optimal since only average scheming rate matters.

    When η < 1 (concave), pure innocent deployment is optimal since any scheming causes most of the danger.

  4. ^

    With gradient routing, we train a single neural network on multiple environments while localizing the capabilities and tendencies from each environment to specified modules that can be ablated at inference. This enables mixed deployment from a single training run—instead of training 2^N models for N environments, you train once and create uncorrelated instances through different ablation patterns. See Gradient routing is better than pretraining filtering.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 模型部署 混合部署 纯部署 AI风险 AI alignment Mixed Deployment Pure Deployment AI Safety Model Deployment AI Risk
相关文章