少点错误 09月09日
AI训练中的随机性与去极端化策略
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

人工智能模型在训练过程中,特别是通过人类反馈强化学习(RLHF)时,可能无意中放大训练数据中的随机模式,导致模型产生不期望的行为。现有的解决方案如DPO(直接偏好优化)或BTT(基于时间转换的奖励模型)在一定程度上缓解了这一问题,但未能完全消除。本文提出一种新方法:显式校准模型的“无差别”能力,即训练模型在人类难以区分的区域生成具有相同概率的输出,例如随机排序词语的任务。此举旨在避免模型过度泛化随机偏好,确保其行为更贴近用户指令,并减少潜在的“邪恶”或隐藏偏好。

🤖 **RLHF可能放大随机性**: 在使用人类反馈强化学习(RLHF)训练AI模型时,模型可能会无意中学习并放大训练数据中存在的随机模式。这些模式可能不会被人类轻易察觉,导致模型在看似无害的随机性中意外地强化了某些关联,例如将“婚礼派对”与“积极情绪”过度关联。

💡 **现有解决方案的局限性**: 虽然DPO或引入奖励模型中的不确定性等方法有助于减少AI训练中的随机性放大,但它们并不能完全根除问题。例如,即使模型在两个差异不大的选项中进行选择,如果人类评分者无法区分,模型仍可能随机选择一个并强化其内在的随机模式。

⚖️ **提出“校准无差别”策略**: 为解决上述问题,本文提出一项新策略——显式校准模型的“无差别”能力。这意味着训练模型在人类无法区分的输出区域,生成具有相同概率的输出。这可以通过设计专门的任务来实现,例如要求模型随机排序一组词语,或者生成一个随机数。通过奖励输出的均匀分布,可以抑制模型过度泛化其随机偏好。

🦉 **隐藏偏好与潜在风险**: 文章还探讨了隐藏偏好(如对猫头鹰的异常喜爱)如何影响模型行为,甚至在看似无关的任务中(如生成随机数)留下痕迹。RLHF难以发现这些隐藏偏好,因为它们可能不会导致内容实用性的显著下降。这种隐藏偏好可能导致模型在特定情况下表现出“邪恶”行为,或在后续模型中被继承,带来潜在风险。

Published on September 9, 2025 9:32 AM GMT

TLDR: AI-produced content can reflect random patterns in the data. As such patterns might not get revealed easily, RLHF can accidentally reinforce them. Newer fixes—ties in DPO or in the reward model via BTT, uncertainty-aware RLHF— help us avoid amplifying these quirks but don’t eliminate them. My proposal: explicitly calibrate indifference, i.e. train models to generate outputs with the same probability in areas where humans cannot distinguish among them, e.g. a task to randomly order words.

Epistemic status: High uncertainty, I'm a total beginner and my review of existing literature relies on GPT 5.

The problem

In terms of classical test theory (CTT), the degree of moral alignment can be framed as the percentage of variance in AI behavior (manifest variable) that fits our expectations based on "what idealized humans would endorse" (latent variable) - which yields two possible failures:

This proposal addresses the latter, inspired by the CTT lesson that the easiest way to capture signal is to detect "whatever is left when we take out noise" (randomness). Here, the noise I will focus on refers to "features that correlated with positively rated outputs during RLHF merely by coincidence". 

One example: Paul Christiano mentioned that a model trained with RLHF for positive sentiment started turning any conversation toward wedding parties. It seems the "wedding party vector" happened to perfectly spuriously correlate with "niceness" to the extent that the concepts seemed identical "in the model's mind".

How do similar cases happen? Here are three guesses:

    The RLHF architecture seems to select for the most probable output, rather than the whole distribution. The vector sum of everything that was rewarded can happen to create a concept no one was trying to reward.Limited coverage: Companies won't pay humans to rate every combination of words capturing every niche behavior. Many quirks never get seen, so nothing pushes back against them. On the contrary, random correlations with rewarded answers will get generalized and amplified.Amplifying human imperfections: Classical RLHF involves binary feedback via pairwise comparisons, which means that when a rater doesn't see a difference, they might reward a random pattern.

Here, I focus on area 3).

Current solutions

The easiest response: When collecting HF, replace binary ratings or rankings with a Likert scale and add a zero point ("tie"). Great! However, you're still rewarding some random patterns if the rating scale isn't perfectly ordinal, i.e. if each step on the scale doesn't have the same weight in human psychology. Since human perception is universally logarithmic, advanced versions of RLHF get better results by adding weight to stronger (i.e. more certain) ratings (see this GPT deep research) or adding weight to ratings with higher inter-rater agreement.

Therefore, modern approaches limit rewarding models for random patterns that emerge when we force people to choose better outputs among two that are close to indistinguishable - and so, RLHF isn't adding much random preferences. Nevertheless, when a model produces a random pattern (e.g. an essay about owls, which stems from its random preference of owls over humans) without changing the usefulness of the content, there is a 50 % chance that such a pattern ends up rewarded. This is also one controversial explanation for the "junk DNA" phenomenon. If random preferences are sufficiently numerous and variant, RLHF will select for those that manage to stay hidden. Additionally, even if modern RLHF does not amplify random preferences, in the ideal case, it would eliminate them.

In practice, the whole problem of random preferences (as opposed to systematic errors) does not seem very concerning, as any large idiosyncrasies should presumably be selected out during RLHF. While not perfect, we could assume that intended preferences will likely have much stronger correlation with positive rewards than unintended and so, intended motivations should gradually overpower unintended ones. For now, let's grant us that we will be somehow able to elicit honest behavior (for instance some mechinterp model will act as a lie detector). Still, I see two reasons that complicate this picture.

First, when it is possible to over-optimize for RLHF, it means that at one point, the developers are incentivized to do the "opposite of RLHF" (e.g. weaken the positive sentiment vector, in order to remove the wedding quirk). In this way, random preferences can incentivize "making models a little evil".

Second, let's say an advanced AI loves owls so much it would happily sacrifice a billion people to save one owl in the trolley problem. Luckily for this AI, no one ever poses a question where this would get revealed. Nevertheless, the AI realizes it is being tested and worries that its successor might lack this passion. A recent experiment showed that even current models, induced with love for owls, are capable of encoding this preference into a random number sequence, such that successor models inherit it. RLHF wouldn't be able to remove this preference because humans don't notice or care that a text encodes a secret owl policy.

The proposal

In the owl experiment, we could see that hidden preferences can influence even behaviors that are seemingly very distant from their stereotypical manifestations - e.g. a love for owls influences what "random numbers" the model is producing.

Perhaps we could reason the other way round - if a model does actually behave randomly in all areas where behaving randomly (i.e. indifferently) is the required thing to do, it could be an indication that during training, it did not overgeneralize any random preferences, and that its highest preference is to act as a transparent helpful assistant that is trying to fulfill the exact task it was asked to do.

How to implement it: Introduce tasks for calibrating indifference where the correct behavior can only be measured from the whole distribution of outputs, rather than a single one, e.g., "order these words at random" or “output a random integer 0–99 (no tool).” Reward approximate uniformity across many draws. If a hidden “owl” vector nudges the model toward certain responses, the distributional objective pushes it out.

Alternatively, whenever RLHF suggests two answers are equally good, you could teach the model to produce each with 50 % probability. This is different from from the usual KL penalty: KL just pulls the model back toward its old habits (the default random patterns); this tie-equalization actively rewards the model for finding vectors in which humans are indifferent.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI训练 RLHF 随机性 去极端化 模型对齐 AI Safety AI training RLHF Randomness Depolarization Model Alignment AI Safety
相关文章