Published on September 11, 2025 6:14 AM GMT

Summary: Suppose we have many different AI models, none of which we trust isn’t scheming. Should we deploy multiple copies of our most trusted model, or an ensemble of many different models?

I claim that mixed deployment is better, and offer some recommendations.

1. The case for mixed deployment

In a pure deployment, where we deploy multiple copies of our most trusted model, either all our AIs are scheming, or none are.^[1] Whereas in a mixed deployment, there might be some models scheming and some not. A pure deployment has the advantage of maximising the chance that no AI is scheming, but the mixed deployment has the advantage of maximising the chance that some AIs aren't scheming. Which advantage matters more?^[2]

The optimal deployment depends on how the probability of catastrophe grows with the proportion of scheming AIs.

^[3]

Another way to phrase this:

Is it easy for an non-scheming minority to unilaterally help you despite a scheming majority? If so, pursue mixed deployment.Is it easy for a scheming minority to unilaterally harm you despite an non-scheming majority? If so, pursue pure deployment.

Below, I list considerations in favour of convexity and concavity. Overall, I think that the danger function is convex, so we should pursue a mixed deployment.

1.1. Why danger might be convex

A non-scheming minority can unilaterally prevent catastrophe through several mechanisms:

Exposing signals

Backdoor revelation

Capability revelation

On-policy rogue behaviour

Whistleblowing

Inconsistency exposure

AI Debate

Providing verified code

Audit trail preservation

Understanding assistance

Split consensus

1.2. Why danger might be concave

A misaligned minority can unilaterally cause catastrophe through several mechanisms:

Weight self-exfiltration

Secret exfiltration

Blackmail

Data poisoning

Research sabotage

Vulnerability insertion

Resource hijacking

Vulnerable world hypothesis

1.3. Why danger might be linear

Danger might scale proportionally with the fraction of scheming AIs:

Single critical moment

1.4. My overall assessment

I think convex dangers dominate. That is, I think it's more likely that one good actor suffices for safety than that one bad actor suffices for catastrophe. I'm particularly moved by Exposing signals, Backdoor revelation, and Capability revelation. Hence, I recommend a mixed deployment strategy.

That said, I think the strongest considerations for concave danger are Secret exfiltration, Research sabotage, and Blackmail, but I am hopeful that these dangers could be mitigated with tractable control measures.

2. Recommendations for mixed deployment

2.1. How to deploy diverse models

Random sampling -

Primary with diverse checkers -

Cross-lab monitoring -

AI-enabled coups.

2.2. How to reduce correlation between models

Different system prompts -

Different datasets -

Different training objectives -

Noise injection -

Gradient routing

^[4]

Different labs -

2.3. How to prioritise research

Invest in second-best techniques

Modular systems

^[4]

Prioritize late-stage techniques

^{^}
Buck Shlegeris makes this point in an 80,000 interview:
And a big difference between Google’s situation and the situation of this hypothetical AI company is that at Google, probably less than 1% of your employees are spies. [...] Whereas in the AI situation, it’s very correlated: if you have these AIs, there are probably either none scheming against you or all scheming against you, at least of instances of a particular model. So a lot of the security techniques that you often use in corporate IT are quite inapplicable. So that’s a way in which the situation seems harder than when we’re trying to guard against spies.
^{^}
There are other advantages to mixed deployment that I do not consider here. I think that deploying schemers from different models likely has a lower chance of catastrophe than deploying schemers from the same model:
Incompatible goals - Scheming models might have incompatible goals, causing internal conflict that could lead them to work against each other or expose each other's plans.
Coordination friction - Schemers from different models are less likely to coordinate on Schelling points or communicate steganographically.
For simplicity, this article assumes that deploying schemers from different models has the same catastrophe probability as deploying schemers from the same model. I think this makes the case for mixed deployment conservative.
^{^}
Consider a deployment scenario with 10 types of AI: one innocent type and 9 different suspicious types. They scheme according to a model with three latent states:
We model the danger function as F(s) = s^η where s is the proportion scheming and η determines the shape:
Let n be the proportion of deployed AIs that are innocent. With mixed deployment, we deploy proportion n of innocent and proportion (1-n)/9 of each suspicious type.
The expected danger is P(n) = P(A) · 1 + P(B) · 0 + P(C) · E[(n·I + (1-n)·K/9)^η] where I ~ Bernoulli(0.1) and K ~ Binomial(9, 0.2).
When η = 5 (strongly convex), the optimal deployment is mixed:
When η = 1 (linear), pure innocent deployment is optimal since only average scheming rate matters.
When η < 1 (concave), pure innocent deployment is optimal since any scheming causes most of the danger.
^{^}
With gradient routing, we train a single neural network on multiple environments while localizing the capabilities and tendencies from each environment to specified modules that can be ablated at inference. This enables mixed deployment from a single training run—instead of training 2^N models for N environments, you train once and create uncorrelated instances through different ablation patterns. See Gradient routing is better than pretraining filtering.

Discuss

1. The case for mixed deployment

1.1. Why danger might be convex

1.2. Why danger might be concave

1.3. Why danger might be linear

1.4. My overall assessment

2. Recommendations for mixed deployment

2.1. How to deploy diverse models

2.2. How to reduce correlation between models

2.3. How to prioritise research

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签