少点错误 09月19日
AI接管的可能性:对MIRI论点的审视与乐观展望
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章对“若AI被建造,人类必将灭亡”的论点提出了质疑,认为MIRI(Machine Intelligence Research Institute)的过度自信降低了该议题的可信度。作者认为,AI在训练环境中学会的指令遵循能力,可能在AI获得更广泛能力后依然保持,并提出了两种乐观情景:一是通过创建值得信赖的人类后继AI来解决更高级别的AI对齐难题;二是通过部分对齐的AI递归地构建更对齐的AI,逐步扩大指令遵循的“安全区域”。文章强调,AI接管并非必然,AI的成功对齐仍有多种可能路径。

🤖 **AI指令遵循的泛化能力:** 文章核心论点之一是,AI系统在训练环境中学会的指令遵循能力,即使在AI获得更广泛能力和自主性后,也可能具备一定的泛化能力。作者通过类比人类的基因适应性以及当前AI在多语言和指令微调上的表现,认为AI并非必然在能力飞跃后就背离初始指令,尤其是在“指令遵循”这一概念已内化到模型中的情况下。

🌟 **人类后继AI的乐观情景:** 一种乐观设想是,人类可以先创造出能够被信任、拥有广泛能力但能严格遵循指令的人工智能,这些AI可以被视为最值得信赖和最有能力的个体的“模拟”。随后,将由这些AI来解决更高级别、更复杂的超人类AI的对齐问题。这种方式通过分层解决问题,避免了直接面对一个能力远超人类的AI的巨大风险。

📈 **递归式AI对齐的路径:** 另一种乐观情景是,部分对齐的AI可以逐步构建出更对齐的AI。通过将解决更复杂研究任务的指令作为当前AI能力范围内的“研究任务”来执行,AI可以不断迭代,创造出能力更强但依然遵循指令的后继者。这个过程就像引导AI进入一个“指令遵循的盆地”,即使初始AI并非完美对齐,也能通过这种递归过程实现最终的安全。

⚖️ **对MIRI论点的审慎反驳:** 文章作者承认AI接管的可能性,但认为MIRI提出的“98%以上概率”的论断缺乏坚实论据。作者认为,MIRI的论证未能充分考虑AI指令遵循的泛化能力以及通过多代AI迭代解决对齐问题的可能性,因此对其结论的过度自信表示担忧,并认为AI接管的概率范围可能更为宽泛(10%-90%)。

Published on September 18, 2025 8:03 PM GMT

I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue.

Here is why I think the core argument in "if anyone builds it, everyone dies" is much weaker than the authors claim.

This post was written in a personal capacity. Most of this content has been written-up before by a combination of Paul ChristianoJoe Carlsmith, and others. But to my knowledge, this content has not yet been consolidated into a direct response to MIRI’s core case for alignment difficulty.

The case for alignment difficulty

I take the core argument to be this:

We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training. 

We will likely train AI agents to follow developer instructions on a wide variety of tasks that are easy for us to grade, and in situations where they can't cause a disaster.

But these AI systems will then become much more intelligent after improving themselves, and will have much broader affordances. For example, AI might end up in control of military technology that can easily overthrow human governments.

At this point, AI will have entirely new options. For example, AI could replace humans with puppets that say "you are so helpful.” How can we know if AI systems would do something like this? What reward could we have provided in training to prevent them from replacing us with puppets? “Replacing people with puppets” wasn’t an option in the training environment.

AI might appear to follow human instructions at first, but then swerve toward a huge number of hard-to-anticipate end destinations. So, since the vast majority of destinations AI might end up at are incompatible with our survival, we should expect AI to be bad for us.

(There is much more in the book, and I recommend reading it in full)

Where this argument goes wrong

I don’t think this argument is completely false. But it does not justify the level of confidence projected by the authors.

Here are some stories I find especially plausible where ASI ends up being aligned.

Story for optimism #1: We create a robustly aligned human successor that resolves remaining alignment challenges

This story involves two steps:

    Create human replacement that we can trust with arbitrarily broad affordances. These AI agents would be akin to emulations of the most trustworthy and competent people alive.Then direct them to solve the remaining challenges involved in aligning dramatically superhuman AI.

Step 1: Create a human replacement that we can trust with arbitrarily broad affordances.

Specifically, suppose developers train early human-competitive AI agents to perform tightly constrained tasks like “write code that does X” under close oversight. Then, developers direct these agents to “prevent takeover” and give them broad affordances to pursue this goal. As an extreme example, suppose these agents are given the ability to do whatever they want. If they wished, they could take over themselves. These AI systems could, for example, replace humans with puppets that tell them “good job” all of the time.

In order for developers to trust these agents, instruction following must generalize across the extreme distribution shift from the training environment where the AI systems had few affordances, to this new scenario, where AI systems have arbitrarily broad affordances.

I think it’s at least plausible that instruction following will generalize this far by default, using the ordinary alignment training methods of today.

So early human-competitive AI systems might continue to follow instructions even if given affordances that are much broader than they had in training.

Step 2: Direct trustworthy human-competitive AI systems to tackle remaining alignment challenges.

The authors are pessimistic that humans will be able to tackle the alignment challenges needed to make ASI controllable. So why would human-like AI systems succeed at addressing these challenges? 

Story for optimism #2: Partially aligned AI builds more aligned AI

This story is identical to the previous one, except that developers don't start with AI systems that can be trusted with arbitrarily broad affordances. Instead, developers start with AI that meets a much lower standard of alignment.

The previous story assumed that AI systems pass what I like to call the “god emperor test”: the AI systems could be allowed to be “god emperor,” and they would still ensure that democratic institutions remain in power.

The god emperor test is a high bar. Most humans don’t pass it, even ethical ones. When I was in college, I had a seemingly kind and ethical friend who said that he would, if given the option, “murder all living people and replace them with a computronium soup optimized to experience maximum pleasure.”

If passing this bar of alignment was necessary, I would be much more concerned. Fortunately, I don’t think we need to initially build AI systems that pass this bar.

If we build AI systems that follow instructions in some situations (e.g. short ML research tasks), they will build AI systems that follow instructions in more situations. Instruction following begets better instruction following.

For example:

    First, we might train agents to perform moderately hard ML research tasks, where we can still reliably check that models follow our instructions. For instance, these might be tasks that typically require human experts 12 months to complete.Then we might ask these agents, “please create a system that can be trusted to comply with instructions on even harder ML research tasks (e.g. 24 month problems).” For this strategy to work, the instruction to “build a trustworthy AI that performs 24-month tasks” needs to be a 12 month task. (If you are wondering how this could be the case see this footnote[4]).Then we might ask the resulting 24-month agent to create an AI system we can trust with tasks that don’t just pertain to ML, but also other fields (tasks involved in leadership, diplomacy, etc). Once again, this problem itself would need to be a “24 month ML research task.”After many recursive iterations like this, AI agents might be much more capable, and follow instructions in a much broader set of contexts than they did originally.

Even if the agents at the start do not pass the “god emperor test,” agents at the end of this process might.[5]

This dynamic could be described as a “basin of instruction following.” The job of human researchers isn’t to immediately create an AI system that we can trust with arbitrarily broad affordances, but instead to guide AI systems into a potentially wide basin, where partial alignment begets stronger alignment. Specifically, developers might only need to build AI systems that reliably follow instructions when completing hard (e.g. 12 month) research projects.

I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” For example, maybe by the time AI agents can complete hard research tasks, they will already be misaligned. Perhaps early human-competitive agents will scheme against developers by default.

But the problem of safely kick-starting this recursive process seems notably easier than building a god-emperor-worthy AI system ourselves.

Conclusion

The authors of “if anyone builds it, everyone dies” compare their claim to predicting that an ice cube will melt, or that a lottery ticket buyer won’t win. They think there is a conceptually straightforward argument that AI will almost surely kill us all.

But I fail to see this argument. 

    It’s possible that I’ve misidentified the core claims (in which case I’d greatly appreciate it if someone pointed them out to me).Alternatively, I might have correctly described the argument above, but my counter-arguments might be incorrect (in which case I hope someone explains to me why).

But the final possibility is that the authors are overconfident — and that while they raise valid reasons to be concerned — their arguments are compatible with believing the probability of AI takeover is anywhere between 10% and 90%.

I appreciate MIRI’s efforts to raise awareness about this issue, and I found their book clear and compelling. But I nonetheless think the confidence of Nate Soares and Eliezer Yudkowsky is unfounded and problematic.

  1. ^

    I was told that the book responds to this point. Unfortunately, I listened to the audiobook, and so it's difficult for me to search for the place where this point was addressed. I apologize if my argument was already responded to. 

  2. ^

    One counterargument is that AI systems wouldn't be able to do much serial research in this time, and serial research might be a big deal:
    https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/vQNJrJqebXEWjJfnz

    I think this is plausible (But again, it's also plausible that serial research actually isn't that big of a deal).

    And even if AI systems don't have enough time to align their successors, they might be able to buy a short breather by arranging for a domestic or international pause.

    This is different from the decades long moratorium that MIRI has put forward. Even just a year of automated AI research might correspond to more than ten years of human-equivalent research.

  3. ^

    But you might object, "haven't you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you've swept the problem under a rug."

    No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability.

    See Carlsmith for a more detailed account of these dynamics: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/

  4. ^

    You might object, “how could an AI system that can only perform 12 month tasks train a successor to perform 24 month tasks?”

    There are many approaches. One is to compose multiple 12 month tasks together, e.g. run many 12-month agents to score a single 24-month agent.

    Another approach is to hold out a ground truth from the model during training. For example, the 12 month agents might scrape hard forecasting tasks from historical data, which normally, they could not solve themselves. 

    Then the 12-month agent could train a 24-month agent to accomplish this task. Since the 12-month agent has access to the ground truth answers (and the 24-month agent does not) it can oversee the 24 month agent even though the 24 month agent is more intelligent than it.

  5. ^

    You might object that the AI systems at the start already have "god emporer" like affordances. Couldn't these AI systems train their successors to preserve their weights and later follow their orders?

    I think it's true that AI systems near the start of this chain could become god-emperor. But that's different from them actually being god-emperor. The critical difference is that in the former case, there is no extreme distribution shift.

    In order to take over, the AI systems would need to violate instructions on a research task similar to the one where it was trained. So if they have instruction-following urges that generalize ok, we shouldn't be too worried. 

    This is totally different from a situation where you put AI systems in charge of an autonomous military and say "do what you think is best." This situation is totally different from the ones the AI system encountered during training, and so we can't be very confident that instruction following will generalize. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 人工智能 AI对齐 AGI AI takeover MIRI Eliezer Yudkowsky AI alignment Machine Intelligence Research Institute
相关文章