少点错误 10月07日 14:26
AGI风险:渐进发展与“首次关键尝试”的挑战
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了通用人工智能(AGI)发展的潜在风险,特别是“首次关键尝试”问题。作者Doomimir认为,即使AGI是渐进式发展的,一旦其能力超越人类,即使在早期阶段看似对齐,也可能在后期因无法纠正的错位而对人类构成致命威胁。Simplicia则质疑了这种高风险的确定性,类比深度学习的泛化能力,并提出通过测试和迭代或许能确保AGI在更高级阶段的行为同样良好。Doomimir反驳称,环境的剧烈变化可能导致人类无法预见和解决所有潜在问题。Simplicia进一步质疑Doomimir的论证是否存在“多阶段谬误”,即过度分解问题导致低估整体成功的可能性,并请求一种不依赖于“多个独立失败点”的论证方式。

🤖 **AGI的“首次关键尝试”困境**:文章的核心观点在于,即使AGI是逐步发展的,一旦其能力超越人类,即使在早期阶段表现出对齐,也可能在后期因无法修正的错位而对人类造成致命威胁。这种风险的存在不随发展速度而改变,关键在于“首次关键尝试”时的对齐问题。

🧠 **深度学习的泛化能力与AGI对齐的希望**:Simplicia认为,深度学习的成功在于其泛化能力,即网络倾向于选择简单的、能良好泛化的函数。因此,通过在可观察和可纠正的环境中进行充分的测试和迭代,AGI有可能泛化到在无法纠正的阶段也能良好行为,从而降低风险。

🌌 **环境剧变与预见性挑战**:Doomimir反驳了Simplicia的乐观态度,指出在AGI从可控阶段过渡到不可控阶段时,可能发生大量的环境变化。即使人类预见并尝试解决大部分问题,未预见的问题和部分解决方案的失效,都可能导致灾难性的后果,这是人类难以在现实中应对的。

⚖️ **“多阶段谬误”的质疑**:Simplicia对Doomimir的论证方式提出质疑,认为其可能犯了“多阶段谬误”。通过将一个复杂命题分解为一系列需要独立成功的子命题,可能人为地压低了整体成功的概率。她请求Doomimir提供一种不依赖于“多个独立失败点”的论证,以更准确地评估AGI风险。

Published on October 7, 2025 6:20 AM GMT

Doomimir: The possibility of AGI being developed gradually doesn't obviate the problem of the "first critical try": the vast hypermajority of AGIs that seem aligned in the "Before" regime when they're weaker than humans, will still want to kill the humans "After" they're stronger and the misalignment can no longer be "corrected". The speed of the transition between those regimes doesn't matter. The problem still exists and is still fatal whether it takes a day or a decade.

Simplicia: I agree that the risk you describe is real, but I don't understand why you're so sure the risk is high. As we've discussed previously, the surprising fact that deep learning works at all comes down to generalization. In principle, an astronomical number of functions are compatible with the training data, the astronomical supermajority of which do something crazy or useless on non-training inputs. But the network doesn't have a uniform prior on all possible functions compatible with the training data; there's a bias towards simple functions that "generalize well" in some sense.

There's definitely a risk of goal misgeneralization, where we were mistaken about how behavior in the Before regime generalizes to behavior in the After regime. But if we work hard to test and iterate on our AI's behavior in the settings where we can observe and correct it, isn't there hope of it generalizing to behave well once we can no longer correct it? In analogy, it's not inhumanly hard to design and build machines on land that successfully generalize to functioning on an airplane or in a submarine.

Doomimir: Or in space, or inside the sun? Suppose that two dozen things change between Before and After. Even if you anticipate and try to devise solutions for three-quarters of them, not all of your solutions are going to work on the first critical try, and then there are the problems you failed to anticipate. This isn't the kind of thing humans beings can pull off in real life.

Simplicia: Sorry, this is probably a stupid question, but isn't that reasoning similar to the multiple stage fallacy that you've derided elsewhere?

That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people's reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.

As an illustrative example, suppose that the "correct" probability of some proposition Q is 0.9. Someone who wants to argue that Q is unlikely represents it as a conjunction of two dozen sub-propositions: Q is true if and only if Q1 is true, and Q2 is true given that Q1 is true, and Q3 is true given that Q1 and Q2 are true, and so on up to Q24.

Someone who assumed the sub-propositions Qi were independent and assigned them each a probability 0.95 would only assign Q a probability of 0.95240.29. Indeed, with the assumption of independence of the sub-propositions Qi, one would need to assign each Qi an intuitively "extreme"-looking probability of explog0.9240.996 in order to assign Q the correct probability of 0.9. Which should be a clue that the Qi aren't really independent, that that choice of decomposition into sub-propositions was a poor one—with respect to the goal of getting the right answer, as contrasted to the goal of tricking respondents into assigning a low probability to Q.

So when you posit that two dozen things change between a detectable/correctable-failures regime and a fatal regime, such that the conjunctive probability of not hitting a fatal misgeneralization is tiny, how do I know you're not committing a multiple stage fallacy? Why is that a "correct", non-misleading decomposition into sub-propositions?

In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.

Is there a version of your argument that doesn't depend on the equivalent of, "Suppose there are twenty-four independent things that can go wrong, surely you don't want to bet the world on them each succeeding with probability 0.996"?

Doomimir: You're right: that is a stupid question.

Simplicia: [head down in shame] I know. It's only ... [straightening up] I would like to know the answer, though. [turning to the audience] Do you know?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI 人工智能安全 风险评估 深度学习 泛化能力 对齐问题 多阶段谬误
相关文章