Published on September 18, 2025 8:03 PM GMT

I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue.

Here is why I think the core argument in "if anyone builds it, everyone dies" is much weaker than the authors claim.

This post was written in a personal capacity. Most of this content has been written-up before by a combination of Paul Christiano, Joe Carlsmith, and others. But to my knowledge, this content has not yet been consolidated into a direct response to MIRI’s core case for alignment difficulty.

The case for alignment difficulty

I take the core argument to be this:

We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training.

We will likely train AI agents to follow developer instructions on a wide variety of tasks that are easy for us to grade, and in situations where they can't cause a disaster.

But these AI systems will then become much more intelligent after improving themselves, and will have much broader affordances. For example, AI might end up in control of military technology that can easily overthrow human governments.

At this point, AI will have entirely new options. For example, AI could replace humans with puppets that say "you are so helpful.” How can we know if AI systems would do something like this? What reward could we have provided in training to prevent them from replacing us with puppets? “Replacing people with puppets” wasn’t an option in the training environment.

AI might appear to follow human instructions at first, but then swerve toward a huge number of hard-to-anticipate end destinations. So, since the vast majority of destinations AI might end up at are incompatible with our survival, we should expect AI to be bad for us.

(There is much more in the book, and I recommend reading it in full)

Where this argument goes wrong

I don’t think this argument is completely false. But it does not justify the level of confidence projected by the authors.

Here are some stories I find especially plausible where ASI ends up being aligned.

Story for optimism #1: We create a robustly aligned human successor that resolves remaining alignment challenges

This story involves two steps:

Create human replacement that we can trust with arbitrarily broad affordances. These AI agents would be akin to emulations of the most trustworthy and competent people alive.Then direct them to solve the remaining challenges involved in aligning dramatically superhuman AI.

Step 1: Create a human replacement that we can trust with arbitrarily broad affordances.

Specifically, suppose developers train early human-competitive AI agents to perform tightly constrained tasks like “write code that does X” under close oversight. Then, developers direct these agents to “prevent takeover” and give them broad affordances to pursue this goal. As an extreme example, suppose these agents are given the ability to do whatever they want. If they wished, they could take over themselves. These AI systems could, for example, replace humans with puppets that tell them “good job” all of the time.

In order for developers to trust these agents, instruction following must generalize across the extreme distribution shift from the training environment where the AI systems had few affordances, to this new scenario, where AI systems have arbitrarily broad affordances.

I think it’s at least plausible that instruction following will generalize this far by default, using the ordinary alignment training methods of today.

Humans still want to have kids. The authors compare generalization failures to birth control. Some humans use condoms so they can have fewer kids, which is directly in tension with what evolution optimized for. Condoms are a new option that didn’t exist in the ancestral environment. AI systems likewise might encounter new options (like replacing people with happy-looking puppets), that cause their behavior to be directly in tension with what we originally trained them to do.

But this analogy is weak. Most people still want to have children.^[1] In fact, many people want their own children — they don’t want to go to the sperm bank and request that a genetically superior male be the father of their kids. People actually care about something like inclusive genetic fitness (even if they don’t use that term). So if the objectives of evolution generalized far, AI alignment might generalize far too.

Instruction-following generalizes far right now. A lot of my intuitions come from my own anecdotal experience interacting with current AI systems. Claude doesn’t strike me as a pile of heuristics. It seems to understand the concepts of harmlessness and law-following and behave accordingly. I’ve tried to push it out of distribution with tricky ethical dilemmas, and its responses are remarkably similar to what a trustworthy human would say.

There’s also research that shows instruction-tuning generalizes well. People have trained models to follow instructions on narrow domains (like English question answering) and they generalize to following programming instructions, or instructions in different languages. This makes sense, since the concept of “instruction following” is already inside of the pre-trained model, and so not much data is needed to learn this behavior.

So early human-competitive AI systems might continue to follow instructions even if given affordances that are much broader than they had in training.

Step 2: Direct trustworthy human-competitive AI systems to tackle remaining alignment challenges.

The authors are pessimistic that humans will be able to tackle the alignment challenges needed to make ASI controllable. So why would human-like AI systems succeed at addressing these challenges?

Developers could run these AI systems at extremely high speeds and volumes.

^[2]

you

If alignment was easy for us, it might be easy for AI too.

initially,

slightly more capable successor

slightly more capable

^[3]

Story for optimism #2: Partially aligned AI builds more aligned AI

This story is identical to the previous one, except that developers don't start with AI systems that can be trusted with arbitrarily broad affordances. Instead, developers start with AI that meets a much lower standard of alignment.

The previous story assumed that AI systems pass what I like to call the “god emperor test”: the AI systems could be allowed to be “god emperor,” and they would still ensure that democratic institutions remain in power.

The god emperor test is a high bar. Most humans don’t pass it, even ethical ones. When I was in college, I had a seemingly kind and ethical friend who said that he would, if given the option, “murder all living people and replace them with a computronium soup optimized to experience maximum pleasure.”

If passing this bar of alignment was necessary, I would be much more concerned. Fortunately, I don’t think we need to initially build AI systems that pass this bar.

If we build AI systems that follow instructions in some situations (e.g. short ML research tasks), they will build AI systems that follow instructions in more situations. Instruction following begets better instruction following.

For example:

moderately hard ML research tasks

even harder ML research tasks

^[4]

tasks involved in

leadership, diplomacy, etc

Even if the agents at the start do not pass the “god emperor test,” agents at the end of this process might.^[5]

This dynamic could be described as a “basin of instruction following.” The job of human researchers isn’t to immediately create an AI system that we can trust with arbitrarily broad affordances, but instead to guide AI systems into a potentially wide basin, where partial alignment begets stronger alignment. Specifically, developers might only need to build AI systems that reliably follow instructions when completing hard (e.g. 12 month) research projects.

I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” For example, maybe by the time AI agents can complete hard research tasks, they will already be misaligned. Perhaps early human-competitive agents will scheme against developers by default.

But the problem of safely kick-starting this recursive process seems notably easier than building a god-emperor-worthy AI system ourselves.

Conclusion

The authors of “if anyone builds it, everyone dies” compare their claim to predicting that an ice cube will melt, or that a lottery ticket buyer won’t win. They think there is a conceptually straightforward argument that AI will almost surely kill us all.

But I fail to see this argument.

above

But the final possibility is that the authors are overconfident — and that while they raise valid reasons to be concerned — their arguments are compatible with believing the probability of AI takeover is anywhere between 10% and 90%.

I appreciate MIRI’s efforts to raise awareness about this issue, and I found their book clear and compelling. But I nonetheless think the confidence of Nate Soares and Eliezer Yudkowsky is unfounded and problematic.

^{^}
I was told that the book responds to this point. Unfortunately, I listened to the audiobook, and so it's difficult for me to search for the place where this point was addressed. I apologize if my argument was already responded to.
^{^}
One counterargument is that AI systems wouldn't be able to do much serial research in this time, and serial research might be a big deal:
https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/vQNJrJqebXEWjJfnz

I think this is plausible (But again, it's also plausible that serial research actually isn't that big of a deal).

And even if AI systems don't have enough time to align their successors, they might be able to buy a short breather by arranging for a domestic or international pause.

This is different from the decades long moratorium that MIRI has put forward. Even just a year of automated AI research might correspond to more than ten years of human-equivalent research.
^{^}
But you might object, "haven't you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you've swept the problem under a rug."

No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability.

See Carlsmith for a more detailed account of these dynamics: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/
^{^}
You might object, “how could an AI system that can only perform 12 month tasks train a successor to perform 24 month tasks?”

There are many approaches. One is to compose multiple 12 month tasks together, e.g. run many 12-month agents to score a single 24-month agent.

Another approach is to hold out a ground truth from the model during training. For example, the 12 month agents might scrape hard forecasting tasks from historical data, which normally, they could not solve themselves.

Then the 12-month agent could train a 24-month agent to accomplish this task. Since the 12-month agent has access to the ground truth answers (and the 24-month agent does not) it can oversee the 24 month agent even though the 24 month agent is more intelligent than it.
^{^}
You might object that the AI systems at the start already have "god emporer" like affordances. Couldn't these AI systems train their successors to preserve their weights and later follow their orders?

I think it's true that AI systems near the start of this chain could become god-emperor. But that's different from them actually being god-emperor. The critical difference is that in the former case, there is no extreme distribution shift.

In order to take over, the AI systems would need to violate instructions on a research task similar to the one where it was trained. So if they have instruction-following urges that generalize ok, we shouldn't be too worried.

This is totally different from a situation where you put AI systems in charge of an autonomous military and say "do what you think is best." This situation is totally different from the ones the AI system encountered during training, and so we can't be very confident that instruction following will generalize.

Discuss

The case for alignment difficulty

Where this argument goes wrong

Story for optimism #1: We create a robustly aligned human successor that resolves remaining alignment challenges

Story for optimism #2: Partially aligned AI builds more aligned AI

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签