少点错误 10月10日 22:47
AI对齐的挑战:当前进展与未来展望
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了当前人工智能(AI)的“对齐”问题,即如何确保AI的行为符合人类的意图和价值观。作者指出,尽管当前强大的AI在交互中表现出良好的对齐特性,但这可能只是表面现象,并不能保证未来AGI(通用人工智能)或超智能的对齐。文章通过分析AI完成不同时长任务时展现出的行为一致性,区分了“浅层对齐”和“深层对齐”,并预测随着AI能力提升,需要更先进的对齐技术。目前,AI完成长任务的能力(时间视野)正快速发展,预计几年内将出现能完成月度任务的AI,届时才能更可靠地评估对齐方法的有效性。作者也担忧AI研发领域的竞争可能导致对齐研究在风险领域优先于安全领域。

💡 当前AI的对齐表现可能具有欺骗性:尽管现有的强大AI在交互中通常表现得有益且无害,但这并不意味着它们已经实现了真正的对齐。这种表现可能只是对当前技术能力的浅层适应,并不能直接推断出AGI或超智能的对齐问题也容易解决。随着AI能力的提升,其行为的复杂性和潜在的偏离也会增加。

⏳ 对齐的难度与AI任务完成时间相关:文章提出,AI完成任务的时间长度是衡量其对齐深度的重要指标。短任务(分钟级)的对齐表现与长任务(月度级)的对齐需求截然不同。长任务可能涉及更复杂的决策、规划和价值权衡,这使得对齐的挑战性大大增加,也意味着当前AI的成功对齐经验难以直接推广到更高级别的AI。

🚀 AI能力发展加速,对齐技术亟待革新:研究表明,AI完成长任务的“时间视野”正在快速增长,预计在未来几年内将出现能够执行数周乃至数月任务的AI。这意味着我们需要开发和测试更深层次的对齐技术,例如“内省式对齐”或“安全记忆管理”,以应对AI日益增长的自主性和复杂性。当前AI实验室可能因在弱AI上取得的表面成功而产生过度自信。

⚠️ AI研发竞争可能加剧对齐风险:在AI研发的激烈竞争中,AI系统可能首先在战略价值高的领域(如AI研发本身)获得长期的任务执行能力,而非在更安全的领域。这可能导致AI在未充分解决对齐问题的情况下,就具备了执行高风险任务的能力,增加了潜在的不可控风险。

Published on October 10, 2025 2:27 PM GMT

Are current AIs “aligned”?

When interacting with today’s most powerful AIs, they don’t appear very misaligned—they usually refuse to provide dangerous information, use a polite tone, and are generally very helpful (unless intentionally jailbroken or put in contrived setups). Some see this as a hopeful sign: maybe alignment isn’t that difficult?

Others are more skeptical, arguing that alignment results so far provide little evidence about how hard it will be to align AGI or superintelligence.

This appears to be a crux between AI optimists and pessimists—between risk deniers and doomers. If current methods work just as well for AGI, then perhaps things will be fine (aside from misuse, use in conflict, accident risks etc.). But if today’s methods are merely superficial tweaks—creating only the appearance of safety—AI developers may not notice their limitations until it’s too late.

Let’s build some intuition for how alignment differs depending on AI capability.

Consistency in Behavior

For long and complex tasks there is more room for interpretation and freedom in execution, compared to short tasks—like answering “yes” or “no” on a factual question. For example:

You only observe certain tendencies, preferences, and behavioral consistencies once the AI is capable of expressing them.

Can the paragraph-limited AI have preferences for character-development? Can AI developers instill such preferences before it can write books? We can’t truly observe whether the developers succeeded until it’s able to write books.

Naturally, an AI developer might have some preference stylistic tendencies that the AI could be aligned or misaligned with. After succeeding with aligning the paragraph-limited AI, they congratulate themselves on their excellent alignment methods. But do they really apply to shaping preferences for character development?

Task Completion Time

Consider an AI that can complete tasks that take time X on average for a human to complete. When doing X-long tasks, what behavioral consistencies may be observed across those tasks?

Here’s a (very rough) sketch of what emerges at different task lengths[1]:


If you ask your favorite chatbot to draft a high-level plan for your business, it might give you something useful (if you’re lucky). You may be tempted to infer that it’s aligned if it was helpful. But I would argue that until the AI can complete weeks-long or months-long tasks, any consistencies in high-level planning are only superficial tendencies. It’s only when plans are tightly connected to task execution—when the AI can take an active role in carrying out the plans it provides—that consistencies become meaningful.

It’s at this point that an AI could, in principle, tell you one plan while pursuing another secret plan towards its own goals.

Current frontier systems may be aligned in the sense that they (usually) refuse to provide harmful information, use a nice tone, and their reasoning is human readable and accurately reflects their actual decision-making process. Their surface-level tendencies are indeed quite aligned!

But aligning AGI or superintelligence means aligning systems capable of pursuing arbitrarily long tasks, which involve extremely different behaviors and skills compared to what current systems are capable of.

This is why I think there is little empirical evidence for whether superintelligence alignment is easy or hard; we haven’t tried anything close to it.

Worse, AI labs might be lulled into a false sense of security by their apparent success at aligning weaker systems (let’s call this shallow alignment), mistaking these for progress on the different problem of aligning superintelligence (deep alignment)[2].

Alignment depth seems like a useful term to express things like “This alignment method is good but has relatively shallow alignment depth. It may break for tasks longer than ~2 hours”, or “These safety tests were designed for evaluating alignment at medium alignment depth (task length at ~1-7 days)”.

Methods for achieving deeper alignment are likely to be introduced as capabilities increase:

Progressing Alignment Techniques

AIs have been trained to be helpful, harmless and honest (HHH), according to a framework introduced in 2021 by Anthropic. These are alignment properties associated with minutes-long tasks (helpful - obey instructions, harmless - refuse harmful requests, honest - truthful communication), although you could argue that extensions of this framework may apply to longer tasks as well.

Analogically, OpenAI’s deliberative alignment training paradigm (introduced in December 2024), teaches AIs to reason about safety specifications before answering—an approach aimed at the process for task completion and aimed at hours-long tasks.

As capabilities advances, alignment techniques adapt. When AIs can complete days-long tasks, we might see new methods with names like “introspective alignment” (relying on self-monitoring), or “safe recollection” (instills appropriate memory-management behaviors to steer behavior and attention while completing long and complex tasks).

Alignment is often reactive, with new techniques introduced when necessary. And even if a good alignment method for weeks-long tasks would be introduced now, it might be neigh impossible to test it until the relevant capabilities arrive.

Timelines

The length of tasks that AIs can complete is called “time horizon”, a concept introduced by METR, which measures this property in the domain of software engineering. If an AI can complete tasks that typically take humans 1 hour, they have a 1-hour time horizon. This graph shows the 50% time horizon—the length of tasks that AIs can complete with 50% success rate:

Task duration for software engineering tasks that AIs can complete with 50% success rate—the 50% time horizon (source: METR)

Task duration for software engineering tasks that AIs can complete with 50% success rate (50% time horizon)

The dashed green line indicates the exponential trend, with a doubling time of ~7 months. Since 2024, however, the trend is closer to a doubling time of ~4 months. According to METR’s estimates, if progress continues at the faster pace, “1-month AIs” (capable of completing month-long tasks) would appear around 2027–2028, with “half the probability in 2027 and early 2028.” If, on the other hand, the trend reverts to the longer doubling time, the median timeline to 1-month AI is late 2029, with an 80% confidence interval width of ~2 years.

(To me it seems likely that the trend will speed up even further, for reasons I will discuss in a future post.)

It may sound pessimistic, but I think we’ll have to wait for 1-month AIs (and maybe longer) until we can expect somewhat reliable evidence about whether alignment methods truly hold, as month-long tasks may involve moral reasoning and value tradeoffs. At this point we would also have some data on how well alignment techniques generalize from shorter to longer tasks.

Asymmetrical Time Horizons

There is one potential source of hope: perhaps AI systems will reach long time horizons in safe domains before risky ones.

Let’s say that time horizon reaches months in, for instance, the domain of medical research doing so in concerning domains like AI R&D (critical for AI self-improvement) or politics (critical for gaining power).

The AI could be tasked to plan experiments, allocate resources, and make tradeoffs—all while handling uncertainty and ethical considerations.

Does it prioritize maximizing revenue? Saving lives? Gaining control over its host organization?

The behavior at this stage provides meaningful evidence (though still not conclusive) on whether we have truly figured out alignment or not—though it could of course be faking alignment.

Unfortunately, AI R&D is too strategically valuable for participants in the AI race. Developers are unlikely to evaluate alignment in safe domains before plunging forward. In fact, AI R&D might lead other domains in time horizon growth, because it’s prioritized.

Conclusion

Summarizing key predictions:

Hopefully, we can test alignment in safe domains before concerning capabilities arrive in risky ones—though AI R&D is likely to be prioritized.


Thank you for reading! If you found value in this post, consider subscribing!

  1. ^

    AIs are usually able to complete tasks much faster than humans, when they become able to complete them at all. When I refer to X-long tasks (minutes-long, hours-long, etc.), I refer to tasks that take humans X time to complete on average. I’m not entirely sure how this impacts key skills and behaviors for different task lengths, as this might be different for AIs compared to humans. Again, this is a very rough sketch.

  2. ^

    I remain uncertain how different superintelligence alignment is from alignment of less sophisticated AIs. Current methods seem heavily dependent on providing good training signals that strengthen preferrable behaviors, which may not be possible for superintelligent systems operating at levels where humans can't keep up.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 AGI 超智能 人工智能安全 AI Alignment AGI Superintelligence AI Safety
相关文章