关于AI发展连续性与对齐问题的讨论

Published on September 23, 2025 7:30 PM GMT

A number of reviewers have noticed the same problem IABIED: an assumption that lessons learnt in AGI cannot be applied to ASI -- that there is a "discontinuity or "phase change" -- even under the assumption of gradualism. The only explanation so far is Eliezer's "Dragon story" ... but I find it makes the same assumptions, and Buck seems to find it unsatisfactory , too. Quotes below.

Buck Shirgellis: ""I’m not trying to talk about what will happen in the future, I’m trying to talk about what would happen if everything happened gradually, like in your dragon story!

You argued that we’d have huge problems even if things progress arbitrarily gradually, because there’s a crucial phase change between the problems that occur when the AIs can’t take over and the problems that occur when they can. To assess that, we need to talk about what would happen if things did progress gradually. So it’s relevant whether wacky phenomena would’ve been observed on weaker models if we’d looked harder; IIUC your thesis is that there are crucial phenomena that wouldn’t have been observed on weaker models.

In general, my interlocutors here seem to constantly vacillate between “X is true” and “Even if AI capabilities increased gradually, X would be true”. I have mostly been trying to talk about the latter in all the comments under the dragon metaphor."

Will McAskill: "Sudden, sharp, large leaps in intelligence now look unlikely. Things might go very fast: we might well go from AI that can automate AI R&D to true superintelligence in months or years (see Davidson and Houlden, “How quick and big would a software intelligence explosion be?"). But this is still much slowerthan, for example, the “days or seconds” that EY entertained in “Intelligence Explosion Microeconomics”. And I don’t see any good arguments for expecting highly discontinuous progress, rather than models getting progressively and iteratively better.

In Part I of IABIED, it feels like one moment we’re talking about current models, the next we’re talking about strong superintelligence. We skip over what I see as the crucial period, where we move from the human-ish range to strong superintelligence[1]. This is crucial because it’s both the period where we can harness potentially vast quantities of AI labour to help us with the alignment of the next generation of models, and because it’s the point at which we’ll get a much better insight into what the first superintelligent systems will be like. The right picture to have is not “can humans align strong superintelligence”, it’s “can humans align or control AGI-”, then “can {humans and AGI-} align or control AGI” then “can {humans and AGI- and AGI} align AGI+” and so on.

Elsewhere, EY argues that the discontinuity question doesn’t matter, because preventing AI takeover is still a ‘first try or die’ dynamic, so having a gradual ramp-up to superintelligence is of little or no value. I think that’s misguided. Paul Christiano puts it well: “Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter.”

Scott Alexander: "But I think they really do imagine something where a single AI “wakes up” and goes from zero to scary too fast for anyone to notice. I don’t really understand why they think this, I’ve argued with them about it before, and the best I can do as a reviewer is to point to their Sharp Left Turn essay and the associated commentary and and see whether my readers understand it better than I do. "

Clara Collier: "Humanity only gets one shot at the real test." That is, we will have one opportunity to align our superintelligence. That's why we'll fail. It's almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.

Most people working on this problem today think that AIs will get smarter, but still retain enough fundamental continuity with existing systems that we can do useful work now, while taking on an acceptably low risk of disaster. That's why they bother. Yudkowsky and Soares dismiss these (relative) optimists by stating that "these are not what engineers sound like when they respect the problem, when they know exactly what they're doing. These are what the alchemists of old sounded like when they were proclaiming their grand philosophical principles about how to turn lead into gold."1 I would argue that the disagreement here has less to do with fundamental respect for the problem than specific empirical beliefs about how AI capabilities will progress and what it will take to control them. If one believes that AI progress will be slow and continuous, or even relatively fast and continuous, it follows that we’ll have more than one shot at the goal".

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签