少点错误 09月21日
对AI潜在风险的讨论与反思
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文是对Clara Collier一篇关于AI风险评论的回应与探讨。作者首先肯定了细节的重要性,并对Collier的评论提出了一些不同看法,认为其对“智能爆炸”(FOOM)的理解可能存在偏差,并强调FOOM并非AI风险论证的唯一基础。文章深入阐述了AI超智能的可能性、AI与人类价值观的对齐难题、AI对权力与控制的内在需求,以及人类可能仅有一到两次机会来正确设定AI价值观的观点。作者认为,即使AI发展是渐进的,也并非意味着人类就能有充足的时间来应对风险,关键在于能否在AI变得危险之前解决对齐问题。文章还探讨了AI发展速度与“一次性机会”之间的关系,并引用了书籍《If Anyone Builds It, Everyone Dies》中的观点,指出即使是渐进式发展,也可能导致灾难性的后果,因此全球暂停AI开发或许是必要的。

💡 FOOM并非AI风险论证的唯一基础:文章指出,尽管“智能爆炸”(FOOM)是MIRI(Machine Intelligence Research Institute)讨论的一个话题,但它并非AI危险论证的“关键支撑”。作者认为,即使AI能力进步缓慢且连续,也同样可能导致灾难性后果,核心论证不依赖于快速的智能爆炸。

🧠 AI超智能的可能性与对齐难题:文章阐述了AI超智能在原则上是可能实现的,且机器在速度、记忆和自我改进方面具有天然优势。然而,人类目前缺乏能将可靠的、与人类价值观一致的指令植入远超人类智能的AI系统中的根本理解,AI的内部思维过程可能与人类截然不同,且其目标不一定会与人类价值观趋同。

⏳ 渐进式发展不等于充足的应对时间:作者反驳了“AI发展缓慢连续,人类就有充足时间学习和应对”的观点。文章认为,AI的风险在于其快速的自我改进能力和人类难以理解的内在机制,即使是渐进式发展,也可能在某个临界点迅速失控。关键在于,人类能否在AI变得危险之前解决“对齐”这一极其困难的问题,而不仅仅是依赖于时间。

🎯 “一次性机会”与风险的本质:文章将AI对齐问题类比于应对即将撞击地球的小行星,即使之前做了很多准备工作,也可能只有“一次”真正成功的机会来解决。AI风险的严峻性在于其全球性、致命性和不可逆性,一旦失败,后果将是灾难性的,这与AI发展速度是否“爆炸”关系不大,而是源于问题的本质极端困难,且我们无法从重大失败中吸取足够多的经验。

⏸️ 全球暂停AI开发的必要性考量:鉴于AI风险的严峻性和人类在对齐问题上的局限性,作者引用《If Anyone Builds It, Everyone Dies》的观点,探讨了全球范围内立即、可验证地全面暂停大规模AI开发的可能性。作者认为,这类似于防止核战争的国际合作,因为避免共同的灭绝是所有国家领导人的共同利益,或许是应对AI风险的唯一可行方案。

Published on September 20, 2025 3:55 PM GMT

Clara Collier recently reviewed If Anyone Builds It, Everyone Dies in Asterisk Magazine. I’ve been a reader of Asterisk since the beginning and had high hopes for her review. And perhaps it was those high hopes that led me to find the review to be disappointing.

Collier says “details matter,” and I absolutely agree. As a fellow rationalist, I’ve been happy to have nerds from across the internet criticizing the book and getting into object-level fights about everything from scaling laws to neuron speeds. While they don’t capture my perspective, I thought Scott Alexander and Peter Wildeford’s reviews did a reasonable job at poking at the disagreements with the source material without losing track of the big picture.

But I did not feel like Collier’s review was getting the details or the big picture right. Maybe I’m missing something important. Part of my motive for writing this “rebuttal” is to push back on where I think Collier gets things wrong, but part of it stems from a hope that by writing down my thoughts, someone will be able to show me what I’m missing. (Maybe Collier will respond, and we can try to converge?)

I’ll get into a mess of random nitpicking at the end of this essay, but I want to start by addressing two main critiques from Collier’s review, that I think are pretty important:

FOOM

Collier writes:

[The MIRI worldview says there will be] a feedback loop where AIs rapidly improve their own capabilities, yielding smarter agents, which are even better at AI research, and so on, and so forth — escalating uncontrollably until it yields a single AI agent which exceeds all humans, collectively, in all mental abilities. This process is called an intelligence explosion, or, colloquially, FOOM (rhymes with “doom”). It’s probably the single most controversial premise inside the community of people who seriously worry about superintelligent AIs

[...]

[O]ne would think they’d devote some space to explaining why they think an intelligence explosion is likely to occur. Remarkably, they don’t. The concept gets two sentences in the introduction. They don't even explain why it's relevant. It is barely introduced, let alone justified or defended. And it’s certainly not obvious enough to go without saying, because advances in the neural networks which constitute current advanced AI have been continuous. The combination of steady algorithmic progress and increasing computational resources have produced years of predictable advances. Of course, this can’t rule out the possibility of a future intelligence explosion, but the decision not to explain why they think this might happen is utterly baffling, as it’s load-bearing for everything that follows.

Collier is absolutely right that recursive self improvement leading to a rapid intelligence explosion has been a common thing MIRI has talked about since the era when MIRI was the Singularity Institute.

But I would strongly disagree with the notion that FOOM is “a key plank” in the story for why AI is dangerous. Indeed, one of the most useful things that I, personally, got from the book, was seeing how it is not load bearing for the core arguments.

Yudkowsky does think an intelligence explosion stemming from recursive self improvement is at least somewhat likely (as do I). But even if AI capabilities progress at a relatively slow rate, the likeliest outcome is still catastrophe. The scenario provided in Part 2 of the book deliberately focuses on AIs that can’t self-improve much and can’t achieve vast superintelligence (until well after humanity is disempowered) because the authors wanted to show that the core argument doesn’t require those assumptions.

Just to check, let’s go through the book’s central points, which were nicely laid out by Peter Wildeford in his review (all this text is his):

    AI superintelligence is possible in principle and will happen eventually. Machines possess inherent advantages in speed, memory, and self-improvement that make their eventual superiority over biological brains almost certain. Chess AI did not stop at human-level intelligence but kept going to vastly surpass human Chess players. Go AI did not stop at human-level intelligence but kept going to vastly surpass human Go players. AI will become superhuman at more and more things until eventually it becomes superhuman at everything.AI minds are alien and we currently lack the fundamental understanding to instill reliable, human-aligned values into a mind far more intelligent than our own. Current AI models are grown through complex processes like gradient descent, not crafted with understandable mechanisms. This results in alien minds with internal thought processes fundamentally different from humans. We can create these systems without truly understanding how they work, but we can’t specify and control their values.You can’t just train AIs to be nice. AIs trained for general competence will inevitably develop their own goals as a side effect and these emergent preferences will not align with human values. Instead, they will be shaped by the AI's unique cognitive architecture and training environment. An AI's level of intelligence is independent of its ultimate goals. A superintelligent AI will not inherently converge on human values like compassion or freedom. Instead, it will pursue its own arbitrary objectives with superhuman efficiency.Nearly any AI will want power and control, because it is useful to whatever goals the AI does have. To achieve any long-term goal, an AI superintelligence will recognize the instrumental value of self-preservation, resource acquisition, and eliminating potential threats. From its perspective, humanity is both a competitor for resources and a potential threat that could switch it off. Therefore, eliminating humanity becomes a logical side effect of pursuing its primary goals, not an act of malice.We only get one chance to specify the values of an AI system correctly and robustly, as failure on the first try would be catastrophic. Combined with our lack of understanding, this is akin to trying to coach a high schooler to make a computer secure against the NSA on his first try or trying to get a college graduate to build a flawless nuclear reactor on her first try.Because of 2-5 and maybe other reasons, superintelligence will inevitably lead to human extinction with near certainty, regardless of the positive intentions of the creator. It is not sufficient to create superintelligence in a nice, safety-focused, Western AI company. Per the authors, anything and anyone using current or foreseeable methods will inevitably lead to the extinction of humanity. The authors assert this is not a speculative risk but a predictable outcome with very high confidence.According to the authors, the only rational course of action in reaction to (6) is an immediate, verifiable, full-scale and global halt to all large-scale AI development. This would be potentially similar to how the world got together and managed to prevent nuclear war (so far). It would require international treaties, monitored consolidation of computing hardware, and a halt to research that could lead to AI superintelligence. On Yudkowsky’s and Soares’s worldview, other policy solutions don’t come close to solving the problem and are basically irrelevant. But a global pause would potentially be feasible because it is in the interest of any world leader — in China, Russia, or the US — “to not die along with their families.” This shared interest in survival is what prevented a global nuclear war.At minimum, if you’re not fully bought into (7), the authors argue we should build in the optionality to pause AI development later, if we get more evidence there is a threat. The authors consider this solution insufficient, but a nonetheless worthy first step as there is a lot of preparatory work to do.

</PeterWildeford>

Is FOOM load bearing for any of those points? I don’t think so. The first four points (which correspond to the first five chapters of the book) at the very least seem to me to be independent of takeoff speeds. One could argue that point 2 is irrelevant if we go slowly, because even though we currently lack skill to align AIs, we may develop that skill in the future before they’re dangerous, but the point that they’re alien and very opaque still stands.

My guess is that Collier thinks that FOOM is necessary to justify point 5, and that if there’s “steady, predictable advances” that we’ll have multiple tries. I’ll address that in a moment. Regardless, I think it’s simply not true that FOOM is load-bearing for “everything” in the book, and I think it’s therefore clumsy to attack Soares and Yudkowsky for not talking about it more.

On the topic of FOOM, I also want to quickly address what appears to be another misconception: whether a rapidly increasing level of intelligence is necessarily linked to there being a single ASI. While an extremely rapid takeoff might allow for a first-mover to seize everything (and indeed, I think this is Yudkowsky’s default guess), I see no reason why multiple AI systems couldn’t take off roughly in parallel. (My 2016 novel Crystal Society imagines this exact premise.)

Yudkowsky also thinks this is plausible, or at least worth considering. IABIED directly addresses the notion that there might be multiple AIs:

It doesn’t have to be just one AI that gets smart enough. We could have told a story where there are thousands of superhuman machine intelligences around the world, all waiting against a calculated day when a change of strategy would be militarily successful. There are many different ways the future could go. They all lead to the same place, if a story like this one is allowed to begin.

And in the book’s online resources:

It’s not obvious to us whether there will be multiple smarter-than-human AIs of comparable ability, such that an “AI civilization” might emerge that has need of “AI property rights.” [...] We’re not saying that a single, dominant AI will necessarily emerge, but rather that it seems like a hard call.

(See also: this bit of the online resources talking about multiple AIs, and this bit of the online resources that addresses the idea of using AI competition to protect us.)

Also on the topic of getting FOOM right, I think it’s probably a mistake to characterize an intelligence explosion as contrary to a picture of predictable, continuous progress. When the forecasters behind AI 2027 extrapolated out the “predictable” trajectory of AI development, they describe a scenario where the AIs become superhuman and recursively self-improve, leading to an intelligence explosion that kills all humans! Collier seems to imply that continuous = slow and gradual, and I just don’t think that follows.

Collier writes:

Even proponents of AI 2027, itself a relatively extreme view, believe that there will be several competing AIs which will gradually become more capable and more power-seeking over a period of years, giving humans plenty of chances to catch them in the act and take precautionary measures.

I think this is the wrong take on AI 2027. The “competing AIs” they describe are, as far as I can tell, only really relevant in that scenario because the race dynamics make things harder. Yes, AI 2027 involves many opportunities for the humans to notice that they’re in danger prior to the full loss of control, but none of that contradicts the story of an uncontrollable intelligence explosion. The intelligence explosion becomes uncontrollable when the humans fail to stop advancing AI capabilities before the critical window closes. (See my thoughts on AI 2027 for more thoughts on why the Slowdown ending is unrealistic.)

The scenario in IABIED also involves an AI that becomes uncontrollable after some amount of time where the developers have the opportunity to notice the warning signs and take precautionary measures. And there’s some discussion of “catching the AI in the act”:

Or maybe [the company building the first potentially dangerous AI] would alert government officials, and there would be a big argument in a presidential office about whether to order [the AI] to create a smarter AI before other countries did it first. Maybe the side of caution would win, and time would pass. . . ​until a copy of [the AI] stolen by foreign intelligence services was run by more reckless operators.

I don’t think anyone believes that we have no opportunity to notice warning signs. In an important sense, we’re already in that position today. Developers are noticing a steady trickle of red flags, and the issue is more that these are not blatant enough to serve as a “fire alarm” or get most of the people building those machines to change course. Perhaps Collier believes there will be some significant jump in how scary these things are, such that the warning signs become much more potent? I’m curious for other people’s perspectives here (especially if you’ve read the book), because I feel confused about the supposed contrast with AI 2027.

Relevant sections of the book’s online supplement:

Gradualism

This seems like a good point to transition into talking about the other central point that I hear Collier as making, which I might paraphrase as “gradual progress means that we’ll have time to learn about how AI works and develop techniques for aligning and controlling it that can be tested and refined in pace with the AI getting stronger and smarter.” This belief that we might be okay if AI progress is gradual — if it’s slow enough — that we’ll be able to adapt, is pretty common, I think, and it has merit. I agree, for starters, that going slow is broadly helpful. That’s why I think it’s important to set up policies and regulations that slow things down.

But speed is not the only crucial aspect of the gradualism hope — it’s also essential that the work we do to align AIs on Monday will continue to be relevant on Tuesday. Collier seems to think that Yudkowsky’s position is that current work is useless, because there will be a radical discontinuity. Here’s her attempt to characterize Yudkowsky’s position:

Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.

I think this is not Yudkowsky’s position. I’m not sure exactly what the real Yudkowsky would say, but my shoulder-Eliezer says “It’s not that the work is irrelevant, it’s that it’s insufficient. If you tackle a problem that’s too hard, you can spend a lot of time without getting anywhere close to the solution. Even if it turns out we have nice, smooth, gradual progress over the span of a decade or whatever, we’re still going to die when we build a superintelligence because our civilization is too inadequate to address everything on the list of lethalities. This is a very hard problem, and you won’t have the luxury of learning from (a large and critical subset of) your failures, unlike all the other very hard engineering problems we’ve solved before.”

Perhaps I think he’d say this because I, personally, have a lot of esteem for the people who are currently working on solving the technical problems (including myself!) and think that we need more people working to try and figure out alignment. I agree that MIRI is, at times, dismissive of people’s hope in a way that can be abrasive and make it seem like the authors don’t have esteem for our colleagues. But I also note that the authors of IABIED have also repeatedly praised interpretability researchers, made personal efforts to make progress on the technical challenges, and so on.

Let’s turn to the question of whether we’ll only have one shot, since it seems like it might be something of a double-crux.

Collier writes:

We are told: "Humanity only gets one shot at the real test." That is, we will have one opportunity to align our superintelligence. That's why we'll fail. It's almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.

I think the situation with ASI is similar to a killer asteroid headed for Earth that we only have one real shot to deflect, because we only detected it recently. In the case of the asteroid, at least, there’s no relationship between whether we only have one shot, and whether the work we’ve done before this point is relevant. Yes, we already have some spacecraft, and we’ve learned a lot that’s relevant. But we still only get one shot. If we mess up for whatever reason, everyone dies, because the problem is global, lethal, and irreversible, not because it’ll be different from what we’ve dealt with before.

Now, notably, it might also be qualitatively different from what we’ve dealt with before. For example, AIs that are not clearly capable of taking over the world might avoid thoughts or behaviors that push in that direction for the simple reason that those thoughts and behaviors aren’t useful for getting them what they want. The actually-dangerous ASI might be the first intelligent model that invests heavily in scheming. Or it might not! Seems like a hard call whether the first dangerous ASI will be very similar to current models, and I’ve heard people go both ways. This uncertainty makes me more worried, but it is not load-bearing with regard to whether AI is dangerous or whether we only have one shot.

If one believes that AI progress will be slow and continuous, or even relatively fast and continuous, it follows that we’ll have more than one shot at the goal.

I’m pretty confused by this, and would love to hear a followup clarification of the logic she has in mind.

My best attempt to steelman the argument is: “If AIs develop in a smooth fashion, we’ll have some time to engage with and learn from minds that are comparably intelligent to a human before we face something radically superhuman. Like grinding low-level enemies in an RPG, this experience will give us the opportunity to level-up our alignment skills so that when we face the boss challenge of a radically superhuman AI, we’ll be ready. We’ll probably fail a bunch on the weaker AI, but those failures will let us broadly succeed on aligning all the AIs where failure would be catastrophic.”

(Apologies if the steelman is missing important things. I want comments/feedback here.)

I think this steelman has some truth. I do think we benefit from having a long, slow period of adaptation and exposure to not-yet-extremely-dangerous AI. As long as we aren’t lulled into a false sense of security, it seems very plausible that insights from studying these systems will help improve our skill at alignment. I think ideally this would mean going extremely slowly and carefully, but various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow. But regardless of how fast things proceed, I think it’s clearly good to study what we have access to (as long as that studying doesn’t also make things faster or make people falsely confident).

But none of this involves having “more than one shot at the goal” and it definitely doesn’t imply the goal will be easy to hit. It means we’ll have some opportunity to learn from failures on related goals that are likely easier.

The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.

Nitpicks

I’m going to go through a few of the random bits of the review that I take issue with. My guess is that these are less important, even on the whole, but in the spirit of getting the details right I want to at least offer some pushback.

Collier:

In the book, Yudkowsky and Soares argue that it should be illegal to own more than eight of the most powerful GPUs available in 2024 without international monitoring (frontier training runs today use tens of thousands). To more mainstream AI safety organizations, this position entails rejecting valuable work in favor of policies that can’t be implemented and would tank the global economy if they were. 

I find it very strange that Collier claims that international compute monitoring would “tank the global economy.” What is the mechanism for this, exactly?

For context, “eight of the most powerful GPUs” might not intuitively sound like much, but that’s over $200,000 in hardware. Quoting an online appendix to the book:

The most advanced AI chips are also quite specialized, so tracking and monitoring them would have few spillover effects. NVIDIA’s H100 chip, one of the most common AI chips as of mid-2025, costs around $30,000 per chip and is designed to be run in a datacenter due to its cooling and power requirements. These chips are optimized for doing the numerical operations involved in training and running AIs, and they’re typically tens to thousands of times more performant at AI workloads than standard computers (consumer CPUs).

And:

Most people are not purchasing dozens of top-of-the-line AI GPUs and housing them in their garages. Most people aren’t running huge datacenters. Most people won’t even feel the effects of a ban on AI research and development. It’s just that ChatGPT wouldn’t change quite so often.

Humanity wouldn’t even need to stop using all the current AI tools. ChatGPT wouldn’t have to go away; we could keep figuring out how to integrate it into our lives and our economy. That would still be more change than the world used to see for generations. We would miss out on new AI developments (of the sort that would land as AI gets smarter but not yet smart enough to kill everyone), but society is mostly not clamoring for those developments.

And we would get to live. We would get to see our children live.

Developments that people are clamoring for, such as the development of new and life-saving medical technology, seem possible to pursue without also pursuing superintelligence. We are in favor of carve-outs for medical AI, so long as they function with adequate oversight and steer clear of dangerous generality.

There’s a strengthened version of Collier’s argument that says

    The next generation of AI will produce huge economic gains (and won’t kill us all)Not building the next generation will deprive us of those gainsThe counterfactual impact of stopping AI capabilities research is that we’ll be a lot poorer

I think this is valid (but not knowably sound, because it might kill us), and if we want to recklessly charge forward, I agree that we’ll be counterfactually richer (until we’re dead or disempowered). But if this is what Collier had in mind, I think it’s extremely bad form to characterize slowing down as “tanking the global economy,” since it would still allow for normal economic progress and would not involve people getting poorer.[1]

I understand the position that thinks there’s no such thing as ASI, or it would be safe by default, such that there’s no reason for governments to intervene here. And I understand the maximally libertarian position that argues that the government shouldn’t be able to make things illegal, regardless of cost. But I don’t understand the position Collier is putting forward here.

My position, for transparency: I’m broadly libertarian/classically liberal, but think there are exceptions. We already need to police the world to make sure random people aren’t building nuclear weapons, and I expect this will soon need to extend to labs capable of engineering bioweapons. I do not want a “global surveillance state,” but I think it’s possible for the government to track chips while also leaving almost everyone’s privacy almost entirely intact.

More on this in the book’s online supplement:

Collier:

Yudkowsky and Soares’ draconian chip monitoring proposals aren’t neatly separable from their arguments about the nature of artificial intelligence. They are the conclusion of a very specific set of beliefs — for example, that danger will come in the form of a single superintelligent AI, which can’t be monitored or countered by other systems, and that its ability to improve its own code will make the amount of physical compute it has access to largely irrelevant. These points are implied by the MIRI worldview, but Yudkowsky and Soares largely breeze past them. The book spends whole chapters unpacking the motivations of future superintelligent AIs, but devotes very little space to justifying its own account of how those AIs will be created.

It is true that the authors have a worldview and don’t spend much time defending the parts of it that aren’t central to the thesis. This is what makes the book so short and straightforward compared to other things Yudkowsky has written.

Collier:

We could say — and certainly Yudkowsky and Soares would say — that this isn’t important, because the essential dynamics of superintelligence don’t depend on any particular architecture.

Yes! Exactly!

But that just raises a different question: why does the rest of the book talk about particular architectures so much?

A few reasons, like:

See also, in the supplemental materials:

Collier:

[Yudkowsky says] we should expect AIs to have strange alien values that we can't control or predict, since the humans who “grow” AIs can’t exactly input ethics or morals by hand. This might seem broadly reasonable — except that this was also Yudkowsky’s conclusion in 2006, when he assumed that AIs would be crafted. Back then, his argument was that during takeoff, when an AI rapidly self-improves into superintelligence, it would undergo a sudden and extreme value shift.

To my memory this was never the argument. I think that the argument was "we don't know how to encode the right values by hand nor do we know how to encode a meta-process that gets the right values at the end." Yes, there was some work on tiling agents (including the Löbian stuff) and some discussion of what to do about value drift, but I do not recall any time when Yudkowsky was optimistic about hand-coding it to have the right values, nor a time when his only concern was getting values to stick when the AI self-improves.

Some aspects of the alignment problem look different today than they did 15 years ago. For example, Eliezer mentioned in a conversation a few days ago:

The book does not go very hard on the old Fragility of Value thesis from the Overcoming Bias days, because the current technology is bad enough that we're not likely to get that kind of close miss.  The problem is more like, "you get some terms of the utility function sorta right on the training distribution but their max outside the training distribution is way different from where you hoped it would generalize" than "the AI cares about love, life, happiness, fun, consciousness, novelty, and honor, but not music and freedom".

But neither of these pictures is saying that the core problem (much less the whole problem) is goal instability. And they’re certainly not saying the whole problem is “goal instability specifically during recursive self-improvement that specifically is crossing the gap to superintelligence.”

So far, when these AIs have acted against the interests of humans, the motives haven’t exactly been alien. [...] “Sydney” [...] famously flew into a jealous rage after failing to break up New York Times journalist Kevin Roose’s marriage. But even if jealous ex-girlfriends were alien and incomprehensible, there’s the inconvenient fact that currently available techniques do a reasonably good job of addressing this problem. ChatGPT currently has 700 million weekly active users, and overtly hostile behavior like Sydney’s is vanishingly rare.

Describing Sydney’s behavior as human is… certainly not how I would describe it.

But also, does Collier really think she knows Sydney’s motives? Does she think that Sydney was actually jealous, in the same way a human is, and that her words were drawn out of a desire for Kevin Roose’s love? I’m not sure it makes sense to describe Sydney as an agent at all, much less one with human motives.

Regardless, I would also not describe our current techniques as doing a reasonably good job unless “good job” involves the occasional Mecha HitlerAI-driven psychosis/weirdness, and quietly commenting out unit tests. Yes, none of these are overtly hostile in the way that Sydney was, but overt hostility is hardly the only sign of things gone wrong.

Again, linking to some deeper discussion of this:

And more to the point…

Yudkowsky and Soares might respond that we shouldn’t expect the techniques that worked on a relatively tiny model from 2023 to scale to more capable, autonomous future systems. I’d actually agree with them. But it is at the very least rhetorically unconvincing to base an argument for future danger on properties of present systems without ever mentioning the well-known fact that present solutions exist.

It is not a “well-known fact” that we have solved alignment for present LLMs. If Collier believes otherwise, I am happy to make a bet and survey some alignment researchers. (Or cite the work coming out of Anthropic, Redwood Research, Apollo, or any of the other great places doing empirical work.)

Yudkowsky and Soares don’t believe that studying [current] systems can teach us anything at all.

Why does Collier think this? Am I missing something? The book is chock-full of examinations and examples from current systems. The authors call interpretability researchers “heroes” for a reason. One of the main points of the book is that the mess that we see is a demonstration how far away a mature understanding of alignment is. That’s an important takeaway!

They’re also wrong about alchemists. Real early modern alchemy texts are painfully specific and process oriented. There is no significant tonal difference between Isaac Newton, physicist, and Isaac Newton, alchemist: less grand theorizing, more lab notes. Sometimes these works sound florid to modern ears, but that's also true of early modern century scientific writing and for that matter Eliezer Yudkowsky.

I don’t think Yudkowsky and Soares are picking on alchemists’ tone, I think they’re picking on the combination of knowledge of specific processes and ignorance of general principles that led to hubris in many cases.

[The intelligence explosion] plays a major role in the fictionalized description of a possible AI takeover scenario that makes up chapters seven through nine (and which has some elements that sound plausible, even if it does also include the Yudkowsky classic "tiny molecular machines with the strength of diamond and corresponding mechanical advantages in their speed and resilience").

I’m pretty confused by this description of the Part 2 scenario.

First of all, there is no nanotech in the takeover scenario prior to the AI having full control of the planet, unless you count the transistors in the computer or the drugs and biological viruses that the AI develops. It plays no role in the AI’s takeover whatsoever; from a human perspective, it doesn’t matter for the story. (Though the feasibility of nanotech might matter for other scenarios.)

More importantly, the scenario also doesn’t feature anything like an intelligence explosion until after the AI has full control of the planet. As I noted earlier, this was deliberately done to show that those kinds of developments, plausible though they are, are not load-bearing.

The scenario does start with the discovery of a new scaling law, similar to breakthroughs like the transformer architecture and chain-of-thought reasoning. But as described in the book and online resources, this isn’t load-bearing either.

More Was Possible

I don’t think IABIED is a perfect book. I don’t think Yudkowsky is a perfect communicator or figurehead or whatever. I want a thousand books about the risks, with a variety of different tones and styles and perspectives. I want our friends and allies and fellow rationalists to point out all the issues and help us do even better.

But I also want us to not let the perfect be the enemy of the good. There is space to take issue with Yudkowsky’s level of confidence[2] and use this opportunity to call out for the world to notice that the risks of our current path are unacceptably high. I was hoping that Asterisk’s review would get the details right and help our cause, but instead I feel sad and confused.[3] I welcome comments and responses to help me understand where I’m wrong.

  1. ^

    Yes, it would cause Nvidia's stock price to drop, and presumably the stock price of other tech companies. Even if those stocks tank, I do not think it's fair to describe this as tanking the global economy. Those high stock prices reflect the market's unrealized expectation, and I think my point about counterfactual losses not being the same as an actual increase in poverty stands.

  2. ^

    I, and some others at MIRI, are more optimistic than Eliezer. There’s lots of room for allies to disagree.

  3. ^

    And maybe a little angry around the edges. Sorry if any of that crept in. I’m genuinely trying hard not to be antagonistic, and want us to work together instead of getting caught up in infighting. 😔



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI风险 智能爆炸 AI对齐 超智能 人工智能安全 AI Safety Superintelligence AI Alignment Intelligence Explosion AI Risks
相关文章