对《若有人建造，众生皆亡》的审视与反思

Published on September 19, 2025 1:23 AM GMT

"If Anyone Builds It, Everyone Dies" by Eliezer Yudkowsky and Nate Soares (hereafterreferred to as "Everyone Builds It" or "IABIED" because I resent Nate's gambitto get me to repeat the title thesis) is an interesting book. One reason it'sinteresting is timing: It's fairly obvious at this point that we're in an alignmentwinter. The winter seems roughly caused by:

The 2nd election of Donald Trump removing Anthropic's lobby from the whitehouse. Notably this is not a coincidence but a directresult of efforts from political rivals to unseat that lobby.When the vice president of the United States is crashing AI safety summitsto say that "I'm not here this morning to talk about AI safety, which was thetitle of the conference a couple of years ago. I'm here to talk about AI opportunity"and that "we'll make every effort to encourage pro-growth AI policies" it's prettyobvious that technical work on "safety" and "alignment" is going to be deprioritizedby the most powerful western institutions and people change their research directionsas a result.

Key figures in AI alignment from the MIRI cluster (especially Yudkowsky) overinvesting inconcepts like deceptive mesaoptimizersand recipes for ruin to create almost unfalsifiable,obscurantist shoggoth of the gaps arguments against neural gradient methods. Atthe same time the convergent representation hypothesishas continued to gain evidence and academic ground. These thinkers gambled everythingon a vast space of minds that doesn't actually exist in practice and lost.

The value loading problem outlined in Bostrom 2014 of getting a general AI systemto internalize and act on "human values" before it is superintelligent and thereforeincorrigible has basically been solved. This achievement also basically always goesunrecognized because people would rather hem and haw about jailbreaks and LLM jankthan recognize that we now have a reasonable strategy for getting a good representationof the previously ineffable human value judgment into a machine and having the machinetake actions or render judgments according to that representation. At the same timepeople generally subconsciously internalize things well before they're capable ofarticulating them, and lots of people have subconsciously internalized that alignmentis mostly solved and turned their attention elsewhere.

I think this last bullet point is particularly unfortunate because solving the Bostrom2014 value loading problem, that is to say getting something functionally equivalentto a human perspective inside the machine and using it to constrain a superintelligentplanner is not a solution to AI alignment. It is not a solution for the simplereason that a general reward model needs to be competent enough in the domainsit's evaluating to know if a plan is good or merely looks good,if an outcome is good or merely looks good, etc. Nearly by definition a merelyhuman perspective is not competent to evaluate the plans or outcomes of plans froma superintelligent planner that will otherwise walk straight into extremal Goodhartoutcomes. Therefore you need not just a human value model but a superintelligenthuman value model, which must necessarily be trained by some kind of self improvingsynthetic data or RL loop starting from the human model which requires us to havea coherent method for generalizing human values out of distribution. This ischallenging because humans do not natively generalize their values out ofdistributionso we don't necessarily know how to do this or even if it's possible to do.The problem is compounded by the fact that if your system drifts away from physicsthe logical structure of the universe will push it back but if your systemdrifts away from human values it stays broken.

Everyone Builds It is not a good book for its stated purposes, but it grew on meby the end. I was expecting to start with the bad and then write about the remainderthat is good, but instead I'll point out this book is actually a substantialadvance for Yudkowsky in that it drops almost all of the rhetoric in bullet twowhich contributed to the alignment winter. This is praiseworthy and I'd liketo articulate some of the specific editorial decisions which contribute to this.

The word "paperclip" does not appear anywhere in the book. Instead Yudkowskyand Soares point out that the opaque implementation details of the neural net meanthat in the limit it generalizes to having a set of "favorite things" it wants tofill the world with which are probably not "baseline humans as they exist now".This is a major improvement over the obstinate insistence that these models willwant a "meaningless squiggle" by default and brings the rhetoric more in linewith e.g. Scott Alexander's AI 2027 scenario.

The word "mesaoptimizer" does not appear anywhere in the book. Insteadit focuses on the point that building a superintelligent AI agent means creatingsomething undergoing various levels of self modification (even just RL weight updates)and predicting the preferences of the thing you get at the end of that processis hard, possibly even impossible in principle. Implicitly the book argues that"caring about humans" is a narrow target and hitting it as opposed to other targetslike "thing that makes positive fun yappy conversation" is hard. That is assumingyou get something like what you train for and doesn't take into account what thebook calls complications. For example it cites the SolidGoldMagikarp incidentas an example of a complication which could completely distort the utility function(another phrase which does not appear in the book) of your superintelligent AIagent. There's precedent for this also in the case of the spiritual bliss attractorstatedescribed in the Claude 4 system card, where instances of Claude talking to eachother wind up in a low entropy sort of mutual Buddhist prayer.

In general the book moves somewhat away from abstraction and comments moreon the empirical strangeness of AI. This gives it a slight Janusian flavorin places, with emphasis on phenomenon like glitch tokens, Truth Terminal,and mentions of "AI cults" that I assume are based on some interpolation ofthings like Janus's Discord server and ChatGPT spiralism cases.If anything the problem is that it doesn't do enough of this, notably absentis reference to work from organizations like METR (if I was writing the booktheir AI agent task lengthstudy would be a necessary inclusion). Though I should note that there's a lagin publishing and it's possible (but unlikely) that Yudkowsky and Soares simplydidn't feel there was any relevant research to cite while doing the bulk of thewriting. Specific named critics are never mentioned or responded to, the textexists in a kind of solipsistic void that contributes to the feeling of greenink or GPT base model output in places. A feeling that notably exists even whenit's saying true things. In general most of my problem with the book is notdisagreements with particular statements

This is good and brings Yudkowsky & Soares much closer to my thread model.

All of this is undermined by truly appalling editorial choices. Foremostof these is the choice to start each chapter with a fictional parable leading tochapters with opening sentences like "The picture we have painted is not real.".The parables are weird and often condescending, and the prose isn't much better.I found the first three chapters especially egregious, with the peak being chapterthree where the book devotes an entire chapter to advocating for a behavioristdefinition of want. This is not how you structure an argument about something youthink is urgent, and the book comes off as having a sort of aloof tone that isdiscordant with its message. This is heightened if you listen to the audiobookversion, which has a narrator who is not the Kurzgesagt narratorbut I think is meant to sound like him since Kurzgesagt has done some EffectiveAltruism videos that people liked. The gentle faux-intellectual British narratorreinforces the sense of passive observation in a book that is ostensibly supposedto be about urgent action. Bluntly: A real urgent threat that demands attentiondoes not begin with "once upon a time". This is technically just a 'style' issue,but the entire point of writing a popular book like this is the style so it's avalid target of criticism and I concur with Shakeel Hashimand Stephen Marche at The New York timesthat it's very bad.

One oddity that stands out is Yudkowsky and Soares ongoing contempt for largelanguage models and hypothetical agents based on them. Again for a book which isexplicitly premised on the idea that urgent action is necessary because AI mightbecome superintelligent in just a few years it is bizarre that the authors don'tfeel comfortable making more reference to the particulars of the existing AI systemswhich hypothetical near-future agents would be based on. I get the impression thatthis is meant to help future proof the book, but it gives the sentences a kind ofweird abstraction in places where they don't need them. We're still talking about"the AI" or "AI" as a kind of tabula-rasa technology. Yudkowsky and Soares stateexplicitly in the introduction that current LLM systems "still feel shallow" tothem. Combined with the parable introductions the book feels like fiction evenwhen it's discussing very real things.

I am in the strange position of disagreeing with the thesis but agreeing with mostindividual statements in the book. Explaining my disagreement would take a lot ofwords that would take a long time to write and that most of you don't want to readin a book review. So instead I'll focus on a point made in the book which I emphaticallyagree with: That current AI lab leadership statements on AI alignment areembarrassing and show that they have no idea what they are doing. In addition tothe embarrassing statements they catalog from OpenAI's Sam Altman, xAI's Elon Musk,and Facebook's Yann Lecun I would add DeepMind's Shane Leggand Demis Hassabisbeing unable to answer straightforward questions about deceptive alignment on a podcast.Even if alignment is relatively easy compared to what Yudkowsky and Soares expectit's fairly obvious that these people don't understand what the problem they'resupposed to be solving even is. This postfrom Gillen and Barnett that I always struggle to find every time I search for itis a decent overview. But that's also a very long post so here is an even shorterproblem statement:

The kinds of AI agents we want to build to solve hard problems require long horizonplanning algorithms pointed at a goal like "maximize probability of observing a futureworldstate in which the problem is solved". Or argxmax(p(problem_solved)) as it's usuallynotated. The problem with pointing a superintelligent planner at argmax(p(problem_solved))explicitly or implicitly (and most training setups implicitly do so) for almostany problem is that one of the following things is liable to happen:

Your representation of the problem is imperfect, so if you point asuperintelligent planner at it you get causal overfitting where the modelidentifies incidental features of the problem like that a human presses a buttonto label the answeras the crux of the problem because these are the easiest parts of the causal chainfor an outcome label that it can influence.

Your planner engages in instrumental reasoning like "in order to continue solvingthe problem I must remain on" and prevents you from turning it off. This is a fairlyobvious kind of thing for a planner to infer for the same reason if you gave anexisting LLM with memory issues a planner (e.g. monte carlo tree search over ReActblocks) it would infer things like "I must place this information here so when itleaves the context window and I need it later I will find it in the first place I look".

So your options are to either use something other than argmax() to solve theproblem (which has natural performance and VNM rationality coherence issues) orget a sufficiently good representation (ideally with confidence guarantees)of a sufficiently broad problem (e.g. utopia) that throwing your superintelligentplanner at it with instrumental reasoning is fine. Right now AI lab leaders donot really seem to understand this, nor is there any societal force which ispressuring them to understand this. I do not expect this book to meaningfullyincrease the pressure on AI lab management to understand this, not even by increasingpopular concern about AI misalignment.

My meta-critique of the book would be that Yudkowsky already has an extensivecorpus of writing about AGI ruin, much of it quite good. I do not just meanThe Sequences, I am talking about his Arbital posts,his earlier whitepapers at MIRI like Intelligence Explosion Microeconomics,and other material which he has spent more effort writing than advertising and asa result almost nobody has read it besides me. And the only reason I've read it isthat I'm extremely dedicated to thinking about the alignment problem. I think anunderrated strategy would be to clean up some of the old writing and advertise itto Yudkowsky's existing rabid fanbase who through inept marketing probably haven'tread it yet. This would increase the average quality of AI discourse from peoplewho are not Yudkowsky, and naturally filter out into outreach projects like RationalAnimations without Yudkowsky having to personally execute them or act as the faceof a popular movement (which he bluntly is not fit for).

As it is the book is OK. I hated it at first and then felt better with furtherreading and reflection. I think it will be widely panned by critics even thoughit represents a substantially improved direction for Yudkowsky that he happenedto argue weirdly.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签