对《如果有人建造它，所有人都会死》的快速读后感

Published on September 28, 2025 5:34 PM GMT

I was hoping to write a full review of "If Anyone Builds It, Everyone Dies" (IABIED Yudkowski and Soares) but realized I won't have time to do it. So here are my quick impressions/responses to IABIED. I am writing this rather quickly and it's not meant to cover all arguments in the book, nor to discuss all my views on AI alignment; see six thoughts on AI safety and Machines of Faithful Obedience for some of the latter.

First, I like that the book is very honest, both about the authors' fears and predictions, as well as their policy prescriptions. It is tempting to practice strategic deception, and even if you believe that AI will kill us all, avoid saying it and try to push other policy directions that directionally increase AI regulation under other pretenses. I appreciate that the authors are not doing that. As the authors say, if you are motivated by X but pushing policies under excuse Y, people will see through that.

I also enjoyed reading the book. Not all parables made sense, but overall the writing is clear. I agree with the authors that the history of humanity is full of missteps and unwarranted risks (e.g. their example of leaded fuel). There is no reason to think that AI would be magically safe on its own just because we have good intentions or that the market will incentivize that. We need to work on AI safety and, even if AI falls short of literally killing everyone, there are a number of ways in which its development could turn out bad for humanity or cause catastrophes that could have been averted.

At a high level, my main disagreement with the authors is that their viewpoint is very "binary" while I believe reality is much more continuous. There are several manifestations of this "binary" viewpoint in the book. There is a hard distinction between "grown" and "crafted" systems, and there is a hard distinction between current AI and superintelligence.

The authors repeatedly talk about how AI systems are grown, full of inscrutable numbers, and hence we have no knowledge how to align them. While they are not explicit about it, they implicit assumption is that there is a sharp threshold between non superintelligent AI and superintelligent AI. As they say "the greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after." Their story also has a discrete moment of "awakening" where "Sable" is tasked with solving some difficult math problems and develops its independent goals. Similarly when they discuss the approach of using AI to help with alignment research, they view it in binary terms: either the AI is too weak to help and may at best help a bit with interpretability, or AI is already "too smart, too dangerous, and would not be trustworthy."

I believe the line between "grown" vs "crafted" is much more blurry than the way the authors present it. First, there is a sense in which complex systems are also "grown". Consider for example, a system like Microsoft Windows with 10s of millions of lines of source code that has evolved over decades. We don't fully understand it either - which is why we still discover zero day vulnerabilities. This does not mean we cannot use Windows or shape it. Similarly, while AI systems are indeed "grown", they would not be used by hundreds of millions of users if AI developers did not have strong abilities to shape them into useful products. Yudkowski and Soares compare training AIs to "tricks .. like the sort of tricks a nutritionist might use to ensure a healthy brain development in a fetus during pregnancy." In reality model builders have much more control over their systems than even parents who raise and educate their kids over 18 years. ChatGPT might sometimes give the wrong answer, but it doesn't do the equivalent of becoming an artist when its parents wanted it to go to med school.

The idea that there would be a distinct "before" and "after" is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time. Based on our experience so far, the default expectation would be that AIs will grow in capabilities, ability for longterm planning and acting, in a continuous way. We also see that AI's skill profile is generally incomparable to humans. (For example it is typically not the case that an AI that achieves a certain score in a benchmark/exam X will perform in task Y similarly to humans that achieve the same score.) Hence there would not be a single moment where AI transitions from human level to superhuman level, but rather AIs will continue to improve, with different skills transitioning from human to superhuman levels at different time.

Continuous improvement means that as AIs become more powerful, our society of humans augmented with AIs is also more powerful, both in terms of defensive capabilities as well as research on controlling AIs. It also mean that we can extract useful lessons about both risks and mitigations from existing AIs, especially if we deploy them in the real world. In contrast, the binary point of view is anti empirical. One gets the impression that no empirical evidence for alignment advances would change the authors' view since it would all be evidence from the "before" times, which they don't believe will generalize to the "after" times.

In particular, if we believe in continuous advances then we have more than one chance to get it right. AIs would not go from cheerful assistants to world destroyers in a heartbeat. We are likely to see many applications of AIs as well as (unfortunately) more accidents and harmful outcomes, way before they get to the combination of intelligence, misalignment, and unmonitored powers that leads to infecting everyone in the world with a virus that gives them "twelve different kinds of cancer" within a month.

Yudkowski and Soares talk in the book about various accidents in nuclear reactors and space ships, but they never mention all the cases that nuclear reactors actually worked and space ships returned safely. If they are right that there is one threshold which once passed, it's "game over" then this makes sense. In the book they make an analogy to a ladder where every time you climb it you get more rewards but once you reach the top rung then the ladder explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签