少点错误 09月27日 00:33
探讨保留链式思考可解释性的策略
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着人工智能模型规模的扩大,链式思考(CoT)的可解释性正面临挑战,这可能削弱我们对AI进行有效控制的手段。本文探讨了应对这一挑战的策略,包括探索不依赖CoT的新型模型架构,以及研究如何使现有CoT更易于理解。文章提出了三种协调机制来解决企业在维持可解释性与效率之间权衡的集体行动问题:协调的自愿承诺、国内AI政策(如监管或激励)以及国际协议。同时,文章强调了理解“可解释性税”(即采用可解释模型带来的效率损失)的重要性,并呼吁技术研究者在量化效率增益、解释模型潜在空间以及降低CoT的训练成本方面做出更多贡献,以期在AI能力飞速发展的同时,仍能有效监控和控制AI的行为。

💡 **CoT可解释性面临挑战,影响AI控制能力**:文章指出,随着AI模型的发展,链式思考(CoT)的可解释性正逐渐降低,甚至可能消失。CoT是目前有效控制AI的少数工具之一,其可解释性的丧失将对AI安全研究和实际应用构成严峻挑战,因为模型内部的潜在表征(latents)极难理解。

🤝 **提出三种协调机制以克服集体行动困境**:为了应对企业在效率与可解释性之间的权衡,文章提出了三种主要的协调机制。第一是“协调的自愿承诺”,即各公司共同承诺维持CoT的可解释性,前提是其成本相对较低。第二是“国内AI政策”,政府可以通过监管或激励措施(如采购要求)来推动模型的可解释性。第三是“国际协议”,通过大国间的合作来共同保障CoT监测的关键性。

⚖️ **量化“可解释性税”是政策可行性的关键**:文章强调,决定政策解决方案可行性的关键在于准确评估“可解释性税”,即采用可解释模型所带来的效率损失。目前这一税率尚不明确,需要技术研究来量化新型非CoT架构(如Neuralese)在AGI阶段可能带来的效率提升,从而判断协调措施的可行性。

🔬 **呼吁技术研究者推动CoT可解释性**:文章敦促技术研究者在几个关键领域展开工作:一是更精确地评估Neuralese等新型架构的效率增益,从而理解可解释性税的大小;二是研究模型潜在空间的解释方法,以应对CoT被取代的可能性;三是努力降低CoT的“可解释性税”,即开发更易于训练且不牺牲性能的可解释CoT模型,为AI安全争取宝贵的时间和洞察力。

Published on September 26, 2025 4:30 PM GMT

Some colleagues and I just released our paper, “Policy Options for Preserving Chain of Thought Monitorability.” It is intended for a policy audience, so I will give a LW-specific gloss on it here.

As argued in Korbak, Balesni et al. recently, there are several reasons to expect CoT will become less monitorable over time, or disappear altogether. This would be bad, given CoT monitoring is one of the few tools we currently have for somewhat effective AI control.

In the paper, we focus on novel architectures that don’t use CoT at all (e.g. ‘neuralese’ from AI2027 where full latent space vectors are passed back to the first layer of the transformer, rather than just one token). We think this is the most concerning route, since model latents are very hard to interpret (so far).

But as we were finalizing our paper, OpenAI and Apollo’s work on scheming detection came out, which shows some CoTs from o3 that are pretty hard to decipher:[1]

So maybe we were too optimistic, and CoT monitoring will become unviable even without neuralese, just with scaling up RL on reasoning models. We are interested in takes on this, and whether there are (low-ish cost) ways to train reasoning models that still produce more legible CoTs. As I understand, DeepSeek went to significant efforts to make its r1 model have a more readable CoT, which suggests it is possible. Another idea I have heard is that you could paraphrase the model’s CoT to itself in training, such that it doesn’t learn to use inscrutable phrases like “glean disclaim disclaim synergy customizing illusions”.

But, despite the worrying signs on CoT readability and monitorability, we tentatively think this is still worth fighting for. In particular, we discuss three coordination mechanisms to solve the collective action problem where no company wants to be the only one to maintain monitorable CoT if it imposes an efficiency penalty:

To decide which of these options is needed, we consider the relative size of the societal benefits and the ‘monitorability tax’ companies pay by using (putatively less efficient) monitorable models.

A key question determining the feasibility of policy solutions is, “How big is the monitorability tax?”. We don’t know, and nor can anyone else, until/unless neuralese or other non-monitorable models exceed SOTA performance from CoT models. But we did ask a miscellaneous sample of more technical AI safety people about this, and the median estimate of the median person was a 10x efficiency gain from neuralese at the time of AGI, with huge variance. The exact question wording is here.

So I’m not sure if we will be able to coordinate to preserve CoT monitorability in the medium/long-term: if the efficiency penalty is too high (maybe more than 3x or so?) neuralese will probably just win out. But should we preserve CoT in the short term? This could go either way:

I am pretty undecided - my guess is CoT is still worth striving to preserve, but the backfire risk does seem real. Overall, I am still keen for more people to work on policy options to preserve CoT monitorability, but it seems like less of a clear win and low-hanging fruit than it did when we started the project.

Here is what I would be particularly keen to see technical researchers do:

  1. ^

    And even before that there had been signs of CoTs of frontier models becoming less readable.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 链式思考 可解释性 AI控制 政策选项 AI Governance Chain-of-Thought AI Monitorability AI Control Policy Options
相关文章