AGI风险：渐进发展与“首次关键尝试”的挑战

Published on October 7, 2025 6:20 AM GMT

Doomimir: The possibility of AGI being developed gradually doesn't obviate the problem of the "first critical try": the vast hypermajority of AGIs that seem aligned in the "Before" regime when they're weaker than humans, will still want to kill the humans "After" they're stronger and the misalignment can no longer be "corrected". The speed of the transition between those regimes doesn't matter. The problem still exists and is still fatal whether it takes a day or a decade.

Simplicia: I agree that the risk you describe is real, but I don't understand why you're so sure the risk is high. As we've discussed previously, the surprising fact that deep learning works at all comes down to generalization. In principle, an astronomical number of functions are compatible with the training data, the astronomical supermajority of which do something crazy or useless on non-training inputs. But the network doesn't have a uniform prior on all possible functions compatible with the training data; there's a bias towards simple functions that "generalize well" in some sense.

There's definitely a risk of goal misgeneralization, where we were mistaken about how behavior in the Before regime generalizes to behavior in the After regime. But if we work hard to test and iterate on our AI's behavior in the settings where we can observe and correct it, isn't there hope of it generalizing to behave well once we can no longer correct it? In analogy, it's not inhumanly hard to design and build machines on land that successfully generalize to functioning on an airplane or in a submarine.

Doomimir: Or in space, or inside the sun? Suppose that two dozen things change between Before and After. Even if you anticipate and try to devise solutions for three-quarters of them, not all of your solutions are going to work on the first critical try, and then there are the problems you failed to anticipate. This isn't the kind of thing humans beings can pull off in real life.

Simplicia: Sorry, this is probably a stupid question, but isn't that reasoning similar to the multiple stage fallacy that you've derided elsewhere?

That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people's reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.

As an illustrative example, suppose that the "correct" probability of some proposition $Q$ is 0.9. Someone who wants to argue that $Q$ is unlikely represents it as a conjunction of two dozen sub-propositions: $Q$ is true if and only if $Q_{1}$ is true, and $Q_{2}$ is true given that $Q_{1}$ is true, and $Q_{3}$ is true given that $Q_{1}$ and $Q_{2}$ are true, and so on up to $Q_{24}$ .

Someone who assumed the sub-propositions $Q_{i}$ were independent and assigned them each a probability 0.95 would only assign $Q$ a probability of ${0.95} \approx 0.29$ . Indeed, with the assumption of independence of the sub-propositions $Q_{i}$ , one would need to assign each $Q_{i}$ an intuitively "extreme"-looking probability of $exp \frac{log 0.9}{24} \approx 0.996$ in order to assign $Q$ the correct probability of 0.9. Which should be a clue that the $Q_{i}$ aren't really independent, that that choice of decomposition into sub-propositions was a poor one—with respect to the goal of getting the right answer, as contrasted to the goal of tricking respondents into assigning a low probability to $Q$ .

So when you posit that two dozen things change between a detectable/correctable-failures regime and a fatal regime, such that the conjunctive probability of not hitting a fatal misgeneralization is tiny, how do I know you're not committing a multiple stage fallacy? Why is that a "correct", non-misleading decomposition into sub-propositions?

In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.

Is there a version of your argument that doesn't depend on the equivalent of, "Suppose there are twenty-four independent things that can go wrong, surely you don't want to bet the world on them each succeeding with probability 0.996"?

Doomimir: You're right: that is a stupid question.

Simplicia: [head down in shame] I know. It's only ... [straightening up] I would like to know the answer, though. [turning to the audience] Do you know?

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签