少点错误 09月19日
对《若有人建造,众生皆亡》的审视与反思
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文对艾里泽·尤德考斯基和内特·索尔斯的著作《若有人建造,众生皆亡》进行了深入探讨。文章指出,尽管该书在某些方面存在争议,但其对AI对齐问题的思考,尤其是在去除了先前极端化的修辞后,代表了尤德考斯基思想的一个重要进步。评论者肯定了书中对价值加载问题已基本解决的看法,但同时指出,这并非AI对齐问题的终极解决方案,因为人类价值的泛化能力有限,需要超智能人类价值模型。此外,文章批评了书中不恰当的叙事结构和对现有AI技术的轻视,但也认同其对AI实验室领导者在对齐问题上认知不足的批评。

📚 **思想演进与修辞转变**: 《若有人建造,众生皆亡》在尤德考斯基的写作生涯中标志着一个重要的转变。书中显著减少了过去导致“AI对齐寒冬”的极端化和煽动性言论,例如不再使用“纸夹”和“欺骗性中介优化器”等概念,而是更侧重于描述AI系统在发展过程中可能出现的对“喜爱之物”的偏好,这使得论述更加贴近现实,类似于斯科特·亚历山大关于AI 2027情景的设想,更具建设性。

💡 **价值加载的进步与局限**: 文章认同书中关于“价值加载问题”(即如何在AI变得超级智能之前使其内化并遵循人类价值观)已基本得到解决的观点。然而,作者强调,这只是朝着AI对齐迈出的一步,而非终点。问题在于,即使AI内化了人类价值观,一个超级智能的规划者可能无法真正理解这些价值观的深层含义,也无法在超出人类经验的领域正确应用它们,因此需要一个“超级智能的人类价值模型”,而这方面的研究仍面临巨大挑战。

🤔 **叙事结构与技术视角争议**: 评论者对本书的叙事方式提出了批评,特别是其以寓言开篇和对行为主义的偏好定义,认为这削弱了紧迫感的传达。此外,书中对当前大型语言模型(LLMs)的轻视,将其视为“浅薄”的“空白技术”,也被认为是一种脱离实际的抽象化处理,未能充分利用现有AI技术作为讨论的基础,使得部分论述显得脱离现实,更像是科幻小说。

⚠️ **AI领导者认知缺陷的共识**: 尽管对书的整体论点持有不同看法,但评论者与书中观点一致,即当前许多AI实验室的领导者在AI对齐问题上表现出令人尴尬的无知。无论是OpenAI的Sam Altman,xAI的Elon Musk,还是DeepMind的Shane Legg和Demis Hassabis,他们在公开场合的表现都显示出对AI对齐核心挑战(如欺骗性对齐)的理解不足,缺乏解决AI潜在风险的能力和紧迫感。

Published on September 19, 2025 1:23 AM GMT

"If Anyone Builds It, Everyone Dies" by Eliezer Yudkowsky and Nate Soares (hereafterreferred to as "Everyone Builds It" or "IABIED" because I resent Nate's gambitto get me to repeat the title thesis) is an interesting book. One reason it'sinteresting is timing: It's fairly obvious at this point that we're in an alignmentwinter. The winter seems roughly caused by:

    The 2nd election of Donald Trump removing Anthropic's lobby from the whitehouse. Notably this is not a coincidence but a directresult of efforts from political rivals to unseat that lobby.When the vice president of the United States is crashing AI safety summitsto say that "I'm not here this morning to talk about AI safety, which was thetitle of the conference a couple of years ago. I'm here to talk about AI opportunity"and that "we'll make every effort to encourage pro-growth AI policies" it's prettyobvious that technical work on "safety" and "alignment" is going to be deprioritizedby the most powerful western institutions and people change their research directionsas a result.

    Key figures in AI alignment from the MIRI cluster (especially Yudkowsky) overinvesting inconcepts like deceptive mesaoptimizersand recipes for ruin to create almost unfalsifiable,obscurantist shoggoth of the gaps arguments against neural gradient methods. Atthe same time the convergent representation hypothesishas continued to gain evidence and academic ground. These thinkers gambled everythingon a vast space of minds that doesn't actually exist in practice and lost.

    The value loading problem outlined in Bostrom 2014 of getting a general AI systemto internalize and act on "human values" before it is superintelligent and thereforeincorrigible has basically been solved. This achievement also basically always goesunrecognized because people would rather hem and haw about jailbreaks and LLM jankthan recognize that we now have a reasonable strategy for getting a good representationof the previously ineffable human value judgment into a machine and having the machinetake actions or render judgments according to that representation. At the same timepeople generally subconsciously internalize things well before they're capable ofarticulating them, and lots of people have subconsciously internalized that alignmentis mostly solved and turned their attention elsewhere.

I think this last bullet point is particularly unfortunate because solving the Bostrom2014 value loading problem, that is to say getting something functionally equivalentto a human perspective inside the machine and using it to constrain a superintelligentplanner is not a solution to AI alignment. It is not a solution for the simplereason that a general reward model needs to be competent enough in the domainsit's evaluating to know if a plan is good or merely looks good,if an outcome is good or merely looks good, etc. Nearly by definition a merelyhuman perspective is not competent to evaluate the plans or outcomes of plans froma superintelligent planner that will otherwise walk straight into extremal Goodhartoutcomes. Therefore you need not just a human value model but a superintelligenthuman value model, which must necessarily be trained by some kind of self improvingsynthetic data or RL loop starting from the human model which requires us to havea coherent method for generalizing human values out of distribution. This ischallenging because humans do not natively generalize their values out ofdistributionso we don't necessarily know how to do this or even if it's possible to do.The problem is compounded by the fact that if your system drifts away from physicsthe logical structure of the universe will push it back but if your systemdrifts away from human values it stays broken.

Everyone Builds It is not a good book for its stated purposes, but it grew on meby the end. I was expecting to start with the bad and then write about the remainderthat is good, but instead I'll point out this book is actually a substantialadvance for Yudkowsky in that it drops almost all of the rhetoric in bullet twowhich contributed to the alignment winter. This is praiseworthy and I'd liketo articulate some of the specific editorial decisions which contribute to this.

    The word "paperclip" does not appear anywhere in the book. Instead Yudkowskyand Soares point out that the opaque implementation details of the neural net meanthat in the limit it generalizes to having a set of "favorite things" it wants tofill the world with which are probably not "baseline humans as they exist now".This is a major improvement over the obstinate insistence that these models willwant a "meaningless squiggle" by default and brings the rhetoric more in linewith e.g. Scott Alexander's AI 2027 scenario.

    The word "mesaoptimizer" does not appear anywhere in the book. Insteadit focuses on the point that building a superintelligent AI agent means creatingsomething undergoing various levels of self modification (even just RL weight updates)and predicting the preferences of the thing you get at the end of that processis hard, possibly even impossible in principle. Implicitly the book argues that"caring about humans" is a narrow target and hitting it as opposed to other targetslike "thing that makes positive fun yappy conversation" is hard. That is assumingyou get something like what you train for and doesn't take into account what thebook calls complications. For example it cites the SolidGoldMagikarp incidentas an example of a complication which could completely distort the utility function(another phrase which does not appear in the book) of your superintelligent AIagent. There's precedent for this also in the case of the spiritual bliss attractorstatedescribed in the Claude 4 system card, where instances of Claude talking to eachother wind up in a low entropy sort of mutual Buddhist prayer.

    In general the book moves somewhat away from abstraction and comments moreon the empirical strangeness of AI. This gives it a slight Janusian flavorin places, with emphasis on phenomenon like glitch tokens, Truth Terminal,and mentions of "AI cults" that I assume are based on some interpolation ofthings like Janus's Discord server and ChatGPT spiralism cases.If anything the problem is that it doesn't do enough of this, notably absentis reference to work from organizations like METR (if I was writing the booktheir AI agent task lengthstudy would be a necessary inclusion). Though I should note that there's a lagin publishing and it's possible (but unlikely) that Yudkowsky and Soares simplydidn't feel there was any relevant research to cite while doing the bulk of thewriting. Specific named critics are never mentioned or responded to, the textexists in a kind of solipsistic void that contributes to the feeling of greenink or GPT base model output in places. A feeling that notably exists even whenit's saying true things. In general most of my problem with the book is notdisagreements with particular statements

This is good and brings Yudkowsky & Soares much closer to my thread model.

All of this is undermined by truly appalling editorial choices. Foremostof these is the choice to start each chapter with a fictional parable leading tochapters with opening sentences like "The picture we have painted is not real.".The parables are weird and often condescending, and the prose isn't much better.I found the first three chapters especially egregious, with the peak being chapterthree where the book devotes an entire chapter to advocating for a behavioristdefinition of want. This is not how you structure an argument about something youthink is urgent, and the book comes off as having a sort of aloof tone that isdiscordant with its message. This is heightened if you listen to the audiobookversion, which has a narrator who is not the Kurzgesagt narratorbut I think is meant to sound like him since Kurzgesagt has done some EffectiveAltruism videos that people liked. The gentle faux-intellectual British narratorreinforces the sense of passive observation in a book that is ostensibly supposedto be about urgent action. Bluntly: A real urgent threat that demands attentiondoes not begin with "once upon a time". This is technically just a 'style' issue,but the entire point of writing a popular book like this is the style so it's avalid target of criticism and I concur with Shakeel Hashimand Stephen Marche at The New York timesthat it's very bad.

One oddity that stands out is Yudkowsky and Soares ongoing contempt for largelanguage models and hypothetical agents based on them. Again for a book which isexplicitly premised on the idea that urgent action is necessary because AI mightbecome superintelligent in just a few years it is bizarre that the authors don'tfeel comfortable making more reference to the particulars of the existing AI systemswhich hypothetical near-future agents would be based on. I get the impression thatthis is meant to help future proof the book, but it gives the sentences a kind ofweird abstraction in places where they don't need them. We're still talking about"the AI" or "AI" as a kind of tabula-rasa technology. Yudkowsky and Soares stateexplicitly in the introduction that current LLM systems "still feel shallow" tothem. Combined with the parable introductions the book feels like fiction evenwhen it's discussing very real things.

I am in the strange position of disagreeing with the thesis but agreeing with mostindividual statements in the book. Explaining my disagreement would take a lot ofwords that would take a long time to write and that most of you don't want to readin a book review. So instead I'll focus on a point made in the book which I emphaticallyagree with: That current AI lab leadership statements on AI alignment areembarrassing and show that they have no idea what they are doing. In addition tothe embarrassing statements they catalog from OpenAI's Sam Altman, xAI's Elon Musk,and Facebook's Yann Lecun I would add DeepMind's Shane Leggand Demis Hassabisbeing unable to answer straightforward questions about deceptive alignment on a podcast.Even if alignment is relatively easy compared to what Yudkowsky and Soares expectit's fairly obvious that these people don't understand what the problem they'resupposed to be solving even is. This postfrom Gillen and Barnett that I always struggle to find every time I search for itis a decent overview. But that's also a very long post so here is an even shorterproblem statement:

The kinds of AI agents we want to build to solve hard problems require long horizonplanning algorithms pointed at a goal like "maximize probability of observing a futureworldstate in which the problem is solved". Or argxmax(p(problem_solved)) as it's usuallynotated. The problem with pointing a superintelligent planner at argmax(p(problem_solved))explicitly or implicitly (and most training setups implicitly do so) for almostany problem is that one of the following things is liable to happen:

    Your representation of the problem is imperfect, so if you point asuperintelligent planner at it you get causal overfitting where the modelidentifies incidental features of the problem like that a human presses a buttonto label the answeras the crux of the problem because these are the easiest parts of the causal chainfor an outcome label that it can influence.

    Your planner engages in instrumental reasoning like "in order to continue solvingthe problem I must remain on" and prevents you from turning it off. This is a fairlyobvious kind of thing for a planner to infer for the same reason if you gave anexisting LLM with memory issues a planner (e.g. monte carlo tree search over ReActblocks) it would infer things like "I must place this information here so when itleaves the context window and I need it later I will find it in the first place I look".

So your options are to either use something other than argmax() to solve theproblem (which has natural performance and VNM rationality coherence issues) orget a sufficiently good representation (ideally with confidence guarantees)of a sufficiently broad problem (e.g. utopia) that throwing your superintelligentplanner at it with instrumental reasoning is fine. Right now AI lab leaders donot really seem to understand this, nor is there any societal force which ispressuring them to understand this. I do not expect this book to meaningfullyincrease the pressure on AI lab management to understand this, not even by increasingpopular concern about AI misalignment.

My meta-critique of the book would be that Yudkowsky already has an extensivecorpus of writing about AGI ruin, much of it quite good. I do not just meanThe Sequences, I am talking about his Arbital posts,his earlier whitepapers at MIRI like Intelligence Explosion Microeconomics,and other material which he has spent more effort writing than advertising and asa result almost nobody has read it besides me. And the only reason I've read it isthat I'm extremely dedicated to thinking about the alignment problem. I think anunderrated strategy would be to clean up some of the old writing and advertise itto Yudkowsky's existing rabid fanbase who through inept marketing probably haven'tread it yet. This would increase the average quality of AI discourse from peoplewho are not Yudkowsky, and naturally filter out into outreach projects like RationalAnimations without Yudkowsky having to personally execute them or act as the faceof a popular movement (which he bluntly is not fit for).

As it is the book is OK. I hated it at first and then felt better with furtherreading and reflection. I think it will be widely panned by critics even thoughit represents a substantially improved direction for Yudkowsky that he happenedto argue weirdly.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 人工智能安全 艾里泽·尤德考斯基 AI风险 价值加载 AI alignment AI safety Eliezer Yudkowsky AI risk value loading
相关文章