AI对齐：是“好”比“聪明”更难实现吗？

Published on October 3, 2025 9:32 PM GMT

This post is part of the sequence Against Muddling Through.

In a review of If Anyone Builds It, Everyone Dies, Will McAskill argued:

The classic argument for expecting catastrophic misalignment is: (i) that the size of the “space” of aligned goals is small compared to the size of the “space” of misaligned goals (EY uses the language of a “narrow target” on the Hard Fork podcast, and a similar idea appears in the book.); (ii) gaining power and self-preservation is an instrumentally convergent goal, and almost all ultimate goals will result in gaining power and self-preservation as a subgoal.
I don’t think that this is a good argument, because appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)

I touched on the "narrow target" idea in my last post. I claim that while compressibility doesn’t tell us what goals are likely to emerge in an AI, it can tell us an awful lot about what goals aren’t, and what kind of sophistication is required to instantiate them on purpose.

To torture Will’s metaphor, if I have reason to believe the target is somewhere out past the Moon, and Will is aiming a musket at it, I can confidently predict he’ll miss.

The way I see it, if we want to align an AI to human values, there are roughly two ways to do it:

CEV

Option (A) requires that we be able to formally specify human values. This seems not-technically-impossible but still obscenely hard. Philosophers and mathematicians have been trying this for millenia and have made only modest progress. It probably takes superhuman reflection and internal consistency to do this for one human, let alone all of us.

Option (B) requires us to aim an AI at something very precisely. We either need a way to formally specify “The CEV of that particular mind-cluster” or (in the machine learning paradigm) we need a training target that will knowably produce this motive.

Labs do not have such a training target.

No, not even human behavior. Human behavior is downstream of a chaotic muddle of inconsistent drives. An ASI needs to be motivated to interrogate, untangle, and enact those drives as they exist in real humans, not imitate their surface manifestations.

I don’t know what this training target would look like.^[1] I don’t think it looks like RLHF, or a written “constitution” of rules to follow, or two AIs competing to convince a human judge, or a weak AI that can’t understand human values auditing a strong AI that doesn’t care about them. I don’t think it looks like any scheme I’ve yet seen.

Compounding this problem, intelligence or competence seems like an attractor state. Almost tautologically, processes that are optimized for general competence and problem-solving tend to get better at it. One lesson of modern machine learning seems to be that it’s easier than we thought to optimize an AI for general competence.

Conversely, I don’t think human values are a natural attractor state for smart things, even smart things trained on human outputs. Some subgoals, like power-seeking, may end up shared, but not things like kindness and humor.

I’m not sure which of these two rationales would govern — raw algorithmic complexity or attractor states — but they both strongly suggest that “good” is harder than “smart” to achieve in practice. In other words, if someone wanted to build a really competent AI, and they spent about half their resources making it smarter and half making it nicer, I’d expect them to make much faster progress on smart than good. In fact, I’d expect this outcome for most possible distributions of resources.

In the next post, I’ll explore how this expectation has borne out empirically. AIs have clearly been getting smarter. Have they been getting more aligned?

Afterword: Some possible cruxes in this post

Prediction

Some may think we have more options for alignment than (A) or (B). If so, I’d be genuinely interested to hear them, and if any seem both workable and not a subset of (A) or (B), then I’ll add them to the list.

My guess is that many people think (B) is easier than I do, or have a subtly different idea of what would constitute (B), in a way that’s worth discussing.

I currently think alignment is likely much harder than any problem humanity has yet solved. I expect some people to disagree with me here. I’m not sure how much. Perhaps some people consider alignment about as hard as the Moon landing; perhaps something easier; perhaps some consider it outright impossible. The degree of difficulty is not quite a crux, because I don’t think the AI industry is currently capable of solving a Moon-landing-level challenge either, but it is certainly load-bearing. If I predicted alignment to be easy, we’d be having a very different conversation.

Some may agree that goodness is a small target, but disagree that it’s much harder to hit — perhaps thinking that something about machine learning makes progress towards goodness far more efficient than progress towards smartness, or that smart things naturally gravitate towards goodness. I know “human values are a natural attractor state” is a crux for some.

^{^}
I can wildly speculate. For instance, you could maybe do something fancy with the uploaded brain of an extremely caring and empathetic person. Nothing like that seems likely to exist in the near future.

Discuss

Afterword: Some possible cruxes in this post

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签