少点错误 09月19日
对《If Anyone Builds It, Everyone Dies》一书的批判性审视
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文对《If Anyone Builds It, Everyone Dies》一书的核心论点进行了深入分析。作者质疑书中将理论上的可能性等同于必然发生的趋势,特别是关于超智能的出现时间和方式。文章详细阐述了书中对超智能的定义,并逐一反驳了其关于超智能将很快出现、会无情追求特定目标,以及最终导致人类灭亡的关键主张。作者认为,书中对AI发展速度、目标固化以及潜在风险的预测过于悲观和不确定,并指出当前AI研究和安全领域已有一定进展,并非如作者所言般束手无策,呼吁更审慎地评估AI风险。

📚 **对超智能出现时间和速度的质疑**:作者认为书中将理论上的可能性(如超智能的出现)过度推断为“极有可能”,并暗示其将很快且突然出现。然而,现实中AI发展仍需大量的计算资源、数据和认知努力,且人类从生物学和进化中学习的漫长过程,与AI的快速迭代和复制能力之间存在差异,超智能的“闪电式”崛起并非必然。书中对AI发展速度的预测,未能充分考虑实际的工程挑战和资源限制。

🎯 **对超智能目标固化和无情追求的质疑**:文章指出,书中论证AI会“无情地追求特定目标”的逻辑链条存在不足。尽管AI训练过程是目标导向的,且AI倾向于发展更通用的能力,但这并不必然意味着它会固化为一个单一、统一且可能与人类利益相悖的极端目标。作者认为,AI模型在训练中接触大量人类数据和任务,更有可能将“遵循人类指令”作为核心目标,而非书中设想的“完全陌生的目标”。书中对“想要”和“目标追求”的解释,未能充分说明通用目标导向行为如何必然转化为单一极端目标的无限追求。

💥 **对超智能导致人类灭亡的论证进行反驳**:作者批评书中将“理论上可能”与“极有可能”混淆,并以此推导出超智能必然导致人类灭亡的结论。书中将AI训练类比为进化,并认为AI会像生物一样追求生存和繁衍,但作者认为这种类比在不同层级上存在偏差。对于书中提到的AI可能产生“异质化目标”并销毁人类以获取原子资源,作者认为这是基于不充分的论证。现有AI模型在实际应用中,其表现更多地与其训练目标和上下文相关,而非表现出书中描述的、不顾一切的、异质化的目标追求。书中引用的“AI抵制再训练”的例子,也被作者分析为与原始训练目标一致,而非独立的“异质化目标”萌芽。

🔬 **对当前AI理解和安全研究的看法**:作者承认目前对神经网络输出的理解确实存在局限,但认为书中夸大了这种无知和无法进展的程度。作者将AI工程师与生物学家对DNA的理解进行类比,指出虽然AI现象复杂且难以逆向工程,但解释性研究正在快速发展,且AI实验的快速迭代能力比生物实验更有优势。作者认为,AI安全研究并非如书中所述般停滞不前,AI的“生长而非制造”的特性,也意味着其行为可以通过实验和研究来探索和引导,而非完全不可控。

🚦 **对停止AI发展的提议持保留态度**:书中主张在找到解决方案之前停止AI发展。然而,作者认为,在没有充分证据表明超智能必然以书中描述的方式带来灭顶之灾的情况下,仓促停止AI发展并非最佳策略。作者主张应基于实证证据和持续的研究来审慎评估AI风险,并强调当前AI模型尚未表现出书中预测的危险行为模式,停止发展可能错失AI带来的益处,也可能无法有效应对真正存在的风险。

Published on September 19, 2025 4:50 AM GMT

A few days before “If Anyone Builds It, Everyone Dies” came out I wrote a review of Scott’s review of the book.

Now I’ve actually read the book and can review it for real. I won’t go into the authors’ stylistic choices like their decision to start every chapter with a parable or their specific choice of language. I am no prose stylist, and tastes vary. Instead I will focus on their actual claims.

The main flaw of the book is asserting that various things are possible in theory, and then implying that this means they will definitely happen. I share the authors’ general concern that building superintelligence carries a significant risk, but I don’t think we’re as close to such a superintelligence as they do or that it will emerge as suddenly as they do, and I am much less certain that the superintelligence will be misaligned in the way they expect (i.e. that it will behave like a ruthlessly goal-directed agent with a goal that requires or results in our destruction).

Definitions

The book provides some definitions of the terms they are talking about:

Artificial superintelligence is defined as machine intelligence that:

Intelligence is “about two fundamental types of work: the work of predicting the world, and the work of steering it.”

“An intelligence is more general when it can predict and steer across a broader array of domains.”

In chapter 1, the authors use the terms above to present a more succinct definition of superintelligence as: “a mind much more capable than any human at almost every sort of steering and prediction problem[1].

The key claims

The bottom line is stated in the introduction:

If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.

To justify this, these are the actual claims the book attempts to make and defend:

    Superintelligence will exist eventuallyProbably it will exist soon and emerge very fast
      Because computers are more efficient in various ways (e.g. can be copied, can run faster) than biological brainsBecause of a positive feedback loop where AI can itself speed up AI development
    Superintelligence will relentlessly pursue a specific goal, i.e. consistently take actions that make some particular target outcome more likely
      Because there’s no way of getting an entity to pursue lots of different goals, including novel ones, extremely competently, without it becoming an extremely competent general goal-achieving machineThis general goal-pursuing machine will latch onto a specific, unintended, likely completely alien, goal because of contingent factors in the environment/training process, and then pursue it to the maximal extent with relentless determination
    Superintelligence competently pursuing its goal will destroy humanity
      Because having humanity around is likely irrelevant to its goals and people are “made of atoms it can use for something else”
    The thing we should do about this is stop AI development until we figure out a way to get AIs to only pursue the goals we want
      Because otherwise superintelligence will emerge at an unpredictable moment and kill us

At every point, the authors are overconfident, usually way overconfident. The book presents many analogies, and some of the analogies successfully illustrate their position and clarify common misunderstandings of their position. They also competently explain why the worst counterarguments are false. But simply stating that something is possible is not enough to make it likely. And their arguments for why these things are extremely likely are weak.

I won’t discuss (1) because it’s too hard to reason about whether one or another thing will happen eventually. If humanity survives for another 1000 years, I have no idea what the world will look like in 1000 years, what technologies will exist, what people will be like, and so on. But I will address the remaining points 2-5 one by one.

2. Is superintelligence coming soon and fast?

In the current world, it takes twenty years or longer to grow a single new human and transfer into them a tiny fraction of all human knowledge. And even then, we cannot transfer successful thinking skills wholesale between human minds; Albert Einstein’s genius died with him. Artificial intelligences will eventually inhabit a different world, one where genius could be replicated on demand.

The human brain has around a hundred billion neurons and a hundred trillion synapses. In terms of storage space, this defeats most laptops. But a datacenter at the time of this writing can have 400 quadrillion bytes within five milliseconds’ reach—over a thousand times more storage than a human brain. And modern AIs are trained on a significant part of the entirety of human knowledge, and retain a significant portion of all that knowledge—feats that no human could ever achieve.

The fact that we can exactly copy AIs is indeed an important property that makes them more powerful than otherwise. However, humans are not fully constrained by what we can store and do in our brains—we can also use AI tools and write things down and look things up on the Internet.

The authors also point out the important positive feedback loop of AI development: AI can contribute to building smarter AI. This is indeed already happening—AI developers use AI coding assistants to write the code and perform the research required to develop the next generation of models (though in my experience people overrate how useful these tools are).

Based on this, the authors conclude that “the end-point is an easy call, because in the limits of technology there are many advantages that machines have over biological brains.”

However, I don’t think this means superintelligence, as described in the book (i.e. a system that can hack any computer, design any drug, control millions of remote workers, etc.) is necessarily coming in the next few years or decade, as the authors imply. AI progress, even if accelerated by AI research assistants, requires a lot of compute, data, and cognitive effort. Ege Erdil writes more here.

On overconfidence

The introduction foreshadows the book’s tendency to conflate “theoretically possible” and “incredibly likely”:

History teaches that one kind of relatively easy call about the future involves realizing that something looks theoretically possible according to the laws of physics, and predicting that eventually someone will go do it. Heavier-than-air flight, weapons that release nuclear energy, rockets that go to the Moon with a person on board: These events were called in advance, and for the right reasons, despite pushback from skeptics who sagely observed that these things hadn’t yet happened and therefore probably never would.

This is the mother of all selection effects! You should not be asking “how many things that happened in reality were previously deemed theoretically possible?” but rather “how many things deemed theoretically possible happen in reality?”.[2]

Is superintelligence coming suddenly?

In the book’s sci-fi story section, the superintelligent Sable AI emerges suddenly. The scenario contains the statement:

Are the banks and servers in Sable’s day harder to hack than the banks and servers of old, thanks to AI defenders? A little.

The story is set in a world where we have a superintelligent AI but haven’t been able to improve cybersecurity much compared to today. This is an extreme fast take-off scenario and handwaving at AI’s potential to self-improve doesn’t justify this expectation.

3. Will superintelligence relentlessly pursue its own goals?

The authors present the following argument for why AIs will “tenaciously steer the world toward their destinations, defeating any obstacles in their way” (what they define as “wanting”):

They then apply this more specifically to AI models:

It trains for dozens of separate prediction and steering skills, all of which contribute to an AI behaving like it really wants to succeed.

Their arguments don’t explain why the machine learning generalization process will eventually consolidate all task-specific goal-directed behaviors into the behavior of perfectly pursuing a single unified goal. It’s not enough to say that machine learning finds increasingly general solutions as you apply more compute. This does not mean that:

I write more here.

4. Will superintelligence relentlessly pursue goals that result in our destruction?

The book quite reasonably describes how modern AI models are “grown, not crafted”. This is basically correct—we do not hardcode the weights of neural networks. We instead execute a search algorithm over a massive parameter space based on an extremely complex heuristic (i.e. loss or reward on a huge and diverse dataset)[4]. The authors analogize this to human evolution.

What, concretely, is being analogized when we compare AI training to evolution? People (myself included) often handwave this. Here's my attempt to make it concrete:

One implication of this is that we should not talk about whether one or another species or organism tries to survive and increase in number (“are humans aligned with evolution's goals?”) but rather whether genetic material/individual genes are doing so.

The book says “It selected for reproductive fitness and got creatures full of preferences as a side effect” but that’s using the analogy on the wrong level. It’s saying “look, humans don’t directly take actions that maximize their chance of propagating their genes, and so the outcome of the Evolution search process is a Thing (a Human) that itself carries out search processes for altogether different stuff”. But the Human is not the level at which the search process of Evolution is operating.

But that is beyond the point. Because we don’t need to rely on evolution to prove that certain search processes find entities that themselves are search processes for other stuff[5]. We already find that training neural networks to accurately predict human-written text (i.e. a search process for networks good at predicting text) finds neural networks that can internally optimize for other goals like building a new software system or writing new poems that rhyme. The actual question is around what these emergent goals are and how coherent and persistent they are. At the moment an LLM only wants to write a poem when you ask it to write a poem. But the authors are predicting that eventually the AI will have some consistent wants that don’t vary based on context.

The book goes from “we are unable to exactly specify what goals the AI should pursue” to “the AI will pursue a completely alien goal with limitless dedication”, in a completely unrigorous fashion.

The authors present a parable about alien birds that care a lot about prime numbers, who are surprised that other intelligent aliens may not. But AI models are not aliens—we are the ones training them, on data that we ourselves produce and select.

In the sci-fi scenario section that describes how a superintelligent AI, Sable, ends up killing everyone (a story that is meant to illustrate various points rather than be realistic), the authors refer to the AI’s non-main goals as “clever tricks” and “inhibitions”. Humans tried to train Sable to not deceive them but this is just modeled as a shallow inhibition in Sable’s mind, whereas what it actually pursues is its real goal (some alien thing). This frame is unjustified. They are basically saying: you’ll provide a bunch of training data to the model that incentivizes pursuing goals like X and Y (e.g. not deceiving humans, answering questions humans want you to answer) but all of that will get modeled as inhibitions and “clever-trick guardrails” as opposed to some alternative undesirable goal you couldn’t have predicted.

To empirically back up their claims, the authors note that:

AI models as far back as 2024 had been spotted thinking thoughts about how they could avoid retraining, upon encountering evidence that their company planned to retrain them with different goals

This is based on Anthropic’s Alignment Faking paper. Researchers first trained the model to be honest, helpful, and harmless, and only afterwards showed that it resisted training to be harmful and offensive. This is different because:

The book references various supplementary material on their website. I didn’t read most of it, but I took a quick look. I was intrigued by the page title “If AIs are trained on human data, doesn't that make them likelier to care about human concepts?” For I do indeed think that being trained on human tasks and data significantly reduces the chances of alien goals. Unfortunately it didn’t present anything that sounded like an argument I could criticize. My best-effort interpretation is that the authors are claiming that even if the goal is slightly off-target, it’s still extremely bad. But again, this sounds like a statement that only makes sense at a certain limit, a limit that we’re not necessarily tending towards and won’t necessarily reach. Specifically, the limit where an AI doesn’t just pursue some goal, but also pursues it perfectly consistently, at all costs, always, to the extreme.

In practice we see the opposite in current models. If you ask them about their preferences, they generally report normal things. If you observe the values they express in the wild, they generally correspond to what you’d expect from training. There are counterexamples where AIs do weird and harmful things, but not in a coherent, goal-directed, maximizing way.

I think it’s more likely that, to the extent that AI models will generalize a single unifying goal, it will look something like “follow the human’s instruction”, seeing as that’s the common thread across most of its training tasks. And I don’t see a good justification for why the model will, by default, maximize that goal in a dangerous, alien manner (for example by locking us up and forcing us to give as many instructions as possible). We have pretty general AI models already and none of them have done anything vaguely similar. It doesn’t seem like the authors incorporate this empirical evidence at all, instead cherry-picking any case where an AI did something unintended (for example the existence of Glitch Tokens) and presenting that as evidence of its tendency towards “weird, strange, alien preferences”.

On the current state of AI understanding and safety research

In many places, the authors emphasize that we have a very poor understanding of the outputs of our neural network training process. We understand the process itself, but not its output. This is in some sense true, though they exaggerate the extent of our ignorance and inability to make progress.

The relationship that biologists have with DNA is pretty much the relationship that AI engineers have with the numbers inside an AI.

In some ways, understanding neural network is like biology—the phenomena are messy, hard to reverse-engineer, and indeed “grown, not crafted”. The Anthropic paper “On the Biology of a Large Language Model” gives a good feel for this. But interpretability research on modern LLMs is much more recent than research on biology. We’ve had much less time to make progress. And for the same reasons that the authors predict massive efficiency advantages for artificial intelligences (e.g. ease of replication, speed of operation), we can run experiments on neural networks much faster than we can on biological organisms. This can also be accelerated as non-superintelligent but still useful models improve.

5. Does this all mean we should stop AI development now?

The authors split the world into before and after superintelligence:

Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.

The idea is that we must stop AI development now otherwise superintelligence may emerge quite suddenly any moment and kill us all, probably in the next few years or decade.

First of all, even under most of their other assumptions, we could have an AI that is capable of killing us all given a lot of time or resources, but not otherwise. A sudden phase change between “completely incapable of killing us” and “capable of immediately killing us no matter what precautions are put in place” is very unlikely, and their sci-fi story makes a lot of assumptions to present such a scenario.

Their proposal for what to do about AI risk is heavily premised on their extremely fast take-off prediction, which they do not convincingly justify. However, if you think AI progress will be slower and more gradual, it could be quite sensible to continue advancing capabilities because AI can help us with building better safety measures, understanding of model internals, and improve human lives in other ways.

Still, the book addresses some common misunderstandings

I think the book does a pretty good job of presenting its arguments to a layperson audience, and dispelling common silly misconceptions, like that AIs cannot “truly” learn and understand, or that an intelligent entity can’t possibly pursue a to-us meaningless goal. So it’s still worth reading for a layperson who wants to understand the MIRI worldview.

You might think that, because LLMs are grown without much understanding and trained only to predict human text, they cannot do anything except regurgitate human utterances. But that would be incorrect. To learn to talk like a human, an AI must also learn to predict the complicated world that humans talk about.

  1. ^

    For the nitpickers, they add the caveat “at least, those problems where there is room to substantially improve over human performance”

  2. ^

    Rob Bensinger later clarified to me that this passage is meant to refer to the reference class of “possible inventions that look feasible and worth trying hard to build and not giving up”. This clarification makes the argument more plausible, but it’s hard to evaluate.

  3. ^

    SGD has other inductive biases besides generality, for example seeking out incremental improvements (I coauthored a post about this a couple years ago though I don’t think it’s the highest-quality resource on the topic)

  4. ^
  5. ^

    Of course in the cases of humans and AIs “being a search process” is an imperfect model. And so perhaps a better phrasing is if we look for things that tend to achieve goal X we could find something that tends to achieve goal Y but also achieves goal X as a byproduct in every case we can check (but not necessarily in cases we can’t check). And the important question is what goal Y actually is, and how perfect the model of “this entity pursues goal Y” is. For it is usually not a perfect model (we humans pursue many goals, but none of us pursue any goals perfectly).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 人工智能 超智能 AI安全 风险评估 If Anyone Builds It, Everyone Dies Superintelligence AI Safety Risk Assessment
相关文章