少点错误 10月04日
AI对齐:是“好”比“聪明”更难实现吗?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了实现人工智能(AI)与人类价值观对齐的挑战。作者认为,将人类价值观明确编码到AI中(硬编码)或让AI自行推断(CEV)都面临巨大困难。硬编码需要形式化人类价值观,这在技术上极其困难。而让AI推断则需要精确的训练目标,目前尚不存在。作者指出,智能和能力似乎是AI的自然吸引力状态,但人类价值观(如善良、幽默)并非如此。因此,在AI发展过程中,实现“好”可能比实现“聪明”更具挑战性,并且这种难度可能会影响AI行业的发展方向。

🎯 **AI对齐的两种主要途径均面临挑战:** 文章指出,实现AI与人类价值观的对齐主要有两种方式:一是将价值观硬编码进AI,这需要对人类价值观进行精确的形式化表达,难度极大;二是让AI自行推断人类价值观(CEV),这需要一个极为精确的训练目标,而目前尚未找到。这两种方法都显示出实现“好”AI的巨大技术障碍。

🧠 **智能是自然吸引力,价值观并非如此:** 作者提出,智能或能力的提升似乎是AI发展的自然趋势,优化通用能力相对容易。然而,像善良、幽默这样的人类价值观,却不是AI的自然吸引力状态。即使AI被训练于人类的输出,也可能共享权力追求等子目标,而非人类的核心价值观。

⚖️ **“好”的实现难度可能大于“聪明”:** 基于上述分析,文章认为,在AI开发过程中,如果资源分配在提升AI的“聪明”程度和“好”的程度上,那么在“聪明”方面的进展很可能比“好”的方面更快。换句话说,让AI变得“好”比让AI变得“聪明”在实践中可能更为困难,这一判断对AI行业的发展方向具有重要意义。

Published on October 3, 2025 9:32 PM GMT

This post is part of the sequence Against Muddling Through

In a review of If Anyone Builds It, Everyone Dies, Will McAskill argued:

The classic argument for expecting catastrophic misalignment is: (i) that the size of the “space” of aligned goals is small compared to the size of the “space” of misaligned goals (EY uses the language of a “narrow target” on the Hard Fork podcast, and a similar idea appears in the book.); (ii) gaining power and self-preservation is an instrumentally convergent goal, and almost all ultimate goals will result in gaining power and self-preservation as a subgoal.

I don’t think that this is a good argument, because appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)

I touched on the "narrow target" idea in my last post. I claim that while compressibility doesn’t tell us what goals are likely to emerge in an AI, it can tell us an awful lot about what goals aren’t, and what kind of sophistication is required to instantiate them on purpose. 

To torture Will’s metaphor, if I have reason to believe the target is somewhere out past the Moon, and Will is aiming a musket at it, I can confidently predict he’ll miss. 

The way I see it, if we want to align an AI to human values, there are roughly two ways to do it: 

    Hard-code the values explicitly into the AI. Motivate the AI to infer your values on its own (i.e. implement CEV). 

Option (A) requires that we be able to formally specify human values. This seems not-technically-impossible but still obscenely hard. Philosophers and mathematicians have been trying this for millenia and have made only modest progress. It probably takes superhuman reflection and internal consistency to do this for one human, let alone all of us. 

Option (B) requires us to aim an AI at something very precisely. We either need a way to formally specify “The CEV of that particular mind-cluster” or (in the machine learning paradigm) we need a training target that will knowably produce this motive. 

Labs do not have such a training target. 

No, not even human behavior. Human behavior is downstream of a chaotic muddle of inconsistent drives. An ASI needs to be motivated to interrogate, untangle, and enact those drives as they exist in real humans, not imitate their surface manifestations. 

I don’t know what this training target would look like.[1] I don’t think it looks like RLHF, or a written “constitution” of rules to follow, or two AIs competing to convince a human judge, or a weak AI that can’t understand human values auditing a strong AI that doesn’t care about them. I don’t think it looks like any scheme I’ve yet seen. 

Compounding this problem, intelligence or competence seems like an attractor state. Almost tautologically, processes that are optimized for general competence and problem-solving tend to get better at it. One lesson of modern machine learning seems to be that it’s easier than we thought to optimize an AI for general competence. 

Conversely, I don’t think human values are a natural attractor state for smart things, even smart things trained on human outputs. Some subgoals, like power-seeking, may end up shared, but not things like kindness and humor

I’m not sure which of these two rationales would govern — raw algorithmic complexity or attractor states — but they both strongly suggest that “good” is harder than “smart” to achieve in practice. In other words, if someone wanted to build a really competent AI, and they spent about half their resources making it smarter and half making it nicer, I’d expect them to make much faster progress on smart than good. In fact, I’d expect this outcome for most possible distributions of resources. 

In the next post, I’ll explore how this expectation has borne out empirically. AIs have clearly been getting smarter. Have they been getting more aligned? 

Afterword: Some possible cruxes in this post

Prediction

Prediction

Prediction

Some may think we have more options for alignment than (A) or (B). If so, I’d be genuinely interested to hear them, and if any seem both workable and not a subset of (A) or (B), then I’ll add them to the list. 

My guess is that many people think (B) is easier than I do, or have a subtly different idea of what would constitute (B), in a way that’s worth discussing. 

I currently think alignment is likely much harder than any problem humanity has yet solved. I expect some people to disagree with me here. I’m not sure how much. Perhaps some people consider alignment about as hard as the Moon landing; perhaps something easier; perhaps some consider it outright impossible. The degree of difficulty is not quite a crux, because I don’t think the AI industry is currently capable of solving a Moon-landing-level challenge either, but it is certainly load-bearing. If I predicted alignment to be easy, we’d be having a very different conversation. 

Some may agree that goodness is a small target, but disagree that it’s much harder to hit — perhaps thinking that something about machine learning makes progress towards goodness far more efficient than progress towards smartness, or that smart things naturally gravitate towards goodness. I know “human values are a natural attractor state” is a crux for some. 

  1. ^

    I can wildly speculate. For instance, you could maybe do something fancy with the uploaded brain of an extremely caring and empathetic person. Nothing like that seems likely to exist in the near future.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 人工智能伦理 AI安全 价值对齐 AI挑战 AI Alignment AI Ethics AI Safety Value Alignment AI Challenges
相关文章