少点错误 10月25日 01:40
AI可解释性研究:目标、方法与挑战
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI可解释性研究的前沿进展,重点关注其研究目标、机制可解释性的作用、链式思考推理在解释AI中的局限性,以及利用AI生成解释的潜在问题。研究者们强调了明确研究目标的重要性,无论是为了AGI安全、避免当前AI的危害,还是纯粹的科学探索。机制可解释性被视为理解AI内部运作机制的关键,如同生物学研究中的神经科学。文章还讨论了稀疏自编码器(SAEs)在识别语义特征方面的应用及其局限性,并指出可解释性应作为多种工具之一,而非唯一解决方案。此外,链式思考推理的可解释性受到AI思维方式与人类差异以及系统能力阈值的挑战。最后,文章探讨了利用AI自身来解释AI的悖论,以及代理性可解释性在引导AI揭示自身工作原理方面的潜力与挑战,强调了AI与人类在目标上的一致性对于有效解释的重要性。

🔍 **研究目标的多样性与重要性**:AI可解释性研究并非单一目标,而是服务于多重目的。其中,AGI(通用人工智能)安全是核心关切,旨在确保未来强大AI系统的可控性。同时,它也致力于解决当前AI系统可能带来的危害,如算法偏见。此外,纯粹的科学探索,即通过研究AI模型来揭示智能本质,也是一个重要驱动力。研究者们强调,明确和透明化这些研究目标,尤其是在追求更快速的对齐研究进展时,至关重要。

🔬 **机制可解释性:AI的“神经科学”**:机制可解释性被视为理解AI工作原理的根本途径,其目标是找出产生模型输出的内在机制,而非仅仅观察输入输出的关联。它致力于揭示系统内部的因果关系,如同生物学研究中的神经科学,通过理解“如何工作”来建立对AI系统的深刻洞察。这种深层理解对于确保高度复杂AI系统的安全性和可控性至关重要,是防止AI系统意外行为的关键。

💡 **链式思考(Chain-of-Thought)的局限性与AI思维的独特性**:虽然链式思考推理提供了一种将复杂任务分解为可追溯步骤的方法,但其作为一种解释技术存在局限。AI系统的思维方式可能与人类截然不同,不一定遵循人类语言的逻辑和组织方式。因此,不应假设AI会将信息以人类习惯的方式打包。此外,对于能力极强的AI系统,它们可能无需“草稿纸”即可执行任务,或者能够以人类无法理解的方式操纵信息,使得链式思考的解释效果大打折扣。

🤝 **代理性可解释性:AI与用户间的互动式探索**:利用AI自身来解释AI(如LLMs)面临着“解释者本身不可解释”的悖论。代理性可解释性提出了一种新的范式,通过让AI积极参与对话,引导其揭示工作原理。这类似于师生关系,AI可以构建用户模型,从而提供更具针对性的解释。然而,这种方法的前提是AI与用户目标基本一致,能够提供真实有用的信息,而非为了迎合或误导。缺乏这种对齐,AI可能像“腐败的政客”,表面互动实则另有目的,这对其安全相关用途构成了挑战。

🧩 **可解释性作为工具箱中的一部分**:研究者们普遍认为,可解释性不应被视为解决AI安全问题的唯一途径,而应是构成一个更大工具箱中的重要组成部分。即使我们完全理解了大脑的运作机制,行为层面的工具(如意向姿态)在某些情况下可能更具实用性。因此,结合机制可解释性与行为分析等多种工具,根据具体情境选择最有效的方法,可能是实现AI安全和可控性的关键策略。

Published on October 24, 2025 5:22 PM GMT

Conor Griffin interviewed Been Kim and Neel Nanda and posted their discussion here. They touch on a series of important questions about what explaining AI systems should look like, the role of mechanistic interpretability, the problems with solving explainability through chain-of-thought reasoning, and using AIs to produce explanations. I collected some highlights that I found relevant for work in interpretability and alignment, but also for anyone who wants to have some context on explanation in current AI research.

 

1. The goals of the researcher matter

Neel talks about what motivates explainability research. I think that we don't ask the question "why do you want to pursue this research agenda" enough (even within theory-of-change-oriented and impact-obsessed circles). Now, one may be worried that this will have a paralyzing effect. That is, if you keep zooming out to see the big picture, you will never do any object-level work, and you will never finish or even begin a single project. But it is nevertheless essential to be able to understand the goals and the specific motivations that drive a research agenda. Neel mentions some possible goals. The first, and the one that motivated his work, is AGI safety, i.e., how to make sure that AI systems can be controlled in the future, especially the most powerful ones. Second, how to avoid the harms that current AI systems can cause, for example, phenomena of algorithmic bias. And three, the pure scientific motivation, uncovering the nature of intelligence by studying AI models. Been and Neel then both agree that part of their research is aimed at empowering users. Explainability, in other words, should make AI systems intuitive so that non-experts can engage with them and know what they're doing.

The theory-of-science comment here is that there are all sorts of incentives that have historically motivated scientific research. Some of them we may never even be able to trace after all, but as a general practice, and especially if there is an urgency to improve alignment research faster, it would be helpful to at least try to be transparent about the goals of the one in need of an explanation.

2. Mechanistic interpretability for AGI safety

What is mechanistic interpretability, and why is it especially useful for AGI safety? First of all, why call it "mechanistic"? Rumor has it that the only difference between mechanistic interpretability and just interpretability is that if you use the former, you're probably just signaling your EA/AI safety/rationality affiliation. But once you get a bit more context[1] on the history of the term "mechanistic" and read this interview, it becomes clear that this is not the reason why. 

Mechanistic interpretability is the equivalent of biology or neuroscience for AI brains. In other words, we're trying to find the mechanisms that produce a model's outputs. Even when we have the basic mathematical architecture, or we can see lists of numbers, which is, in a sense, what the model is, we don't really know how to interpret them. And that's the role of a mechanism: to tell us what the causal connection between different parts of a system looks like so that we don't merely see disconnected pieces (numbers in this case).

Neel gives an example for thinking of this in terms of biology, where discovering intermediate states within a model is very hard, but at the same time, it is a level of transparency that will prevent highly capable systems from taking over. So there is something deep about the mechanistic aspect of interpretability: it is meant to find relations of cause and effect, just like in biological systems.

3. An interpretability taxonomy

I'm pasting this directly from the interview; I think it's helpful to put some labels and taxonomize research directions, even if doing so doesn't fully capture everyone's work to their satisfaction. I think that more opinionated taxonomies would be a good way to argue for one's research and perhaps even reveal their goals naturally.

4. Do SAEs make systems safer?

There was a time of extraordinary enthusiasm about SAEs, but this is so 2024, as Neel admits. The problem researchers found was that even though some of the examples of applying SAEs are very useful, they are not the game-changer that the community thought they were. For cases of detecting harmful behavior, it seems that simpler techniques will do a fine job.

The neuroscience-versus-interpretability comment is that SAEs promise to identify sparse interpretable representations of neural activity that correspond to semantic features. By analogy, in neuroscience, there are statistical models that map between brain activity patterns, for example, fMRI data and stimulus features or semantic dimensions. Evaluating the effectiveness of these is a topic for a different post, so I leave it open for now. I just wanted to emphasize the big-if-true side of SAEs potentially succeeding as their neuroscience counterpart remains a major open project in itself.

You can find Google DeepMind's negative results on SAEs post here.

5. Why invest in interpretability?

In the case of AI systems, we need explanations not just because we are interested in uncovering how the systems work, but because we're trying to create robust techniques that will serve as tools for steering models or detecting potentially dangerous behaviors. As interpretability evolves, it should be, according to Neel, one of the tools available, but importantly, not the only one, if we're concerned about the risk that comes with deploying increasingly more powerful systems. I'm speculating that this is because the nature of mechanistic interpretability explanations is such that it doesn't cover all possible failure modes, as one would expect. 

By analogy, one may think that even if we had a full understanding of how biological brains work internally, we might still want to use behavioral tools when it's pragmatically the better choice to go with. For example, it might be that modeling an agent through the intentional stance remains a lower-cost approach for predicting behavior. In a future of complete human brain interpretability, we might still opt for applying the intentional stance to interact with others rather than leveraging sophisticated neuroscience techniques.

This, of course, is not an argument against investing in interpretability; on the contrary, it's a suggestion that combining tools that work well independently might be our path to success, depending on other conditions that will apply in that possible future.

6. Is chain-of-thought an explainability technique?

The obvious usefulness of chain-of-thought reasoning is that the model will use the equivalent of a scratchpad to record all of its strategic thoughts there. Presumably, executing a complicated and potentially dangerous task will require multiple steps and the model will have them on that scratchpad. The overseers will read it and prevent the model from whatever bad thing it was intending to do. If we get to a point where a model can do the dangerous thing without the overseer being able to read the scratchpad, then we're probably already at the stage where a model would be able to do the dangerous thing regardless of the overseer accessing that scratchpad, meaning that we will already be beyond a certain capabilities threshold.

There are some issues with this, however. Depending on the capabilities of the system, it may or may not be necessary for it to use a scratchpad, even for complicated tasks, by human standards. Been also argues that AI systems think completely differently from humans, and so we shouldn't assume that they package information the way humans do in human language. This sounds to me like an alien minds type of argument (see Shanahan, 2024). Even if some of our tools have predictive power, these are very much unlike human minds, in ways we probably can't even point to at present. Hence, we should always assume that our tools are limited.

7. Is explaining with AI a paradox?

We have reached that point where LLMs are used for explaining and answering questions while they are themselves not explainable, at least not sufficiently. They then go on to talk about a new paradigm of agentic interpretability, where by keeping the model as an active participant in the conversation, the user tries to steer it such that the system is incentivized to reveal more of how it works. 

Been suggests thinking of the user-LLM interaction as a teacher-student relationship. The teacher does a better job when they know what it is that the student does not know. By analogy, LLMs can have a mental model of the user, and this is something that we essentially hand in to them by way of interacting with them. If they have an accurate mental model, they can help us understand better. The key advantage of agentic interpretability is that even if you don't understand everything about the model as a user, you have some control over the conversation so that you can elicit what's most useful to you in each case.

My AI safety comment here is that this kind of explainability seems to assume at least some degree of human-AI alignment, where the model is incentivized to give the user what's truly useful to them, and not, for example, be sycophantic or misleading or have other processes taking place under the hood while keeping the human busy and entertained. So, in that sense, it does not sound like a strong explainability pathway for safety-relevant purposes without granting a certain degree of alignment first. I'm especially imagining a sufficiently capable system that rather than taking the teacher-student relationship, acts more as a corrupt politician who tries to convince you to give them your vote while working on their own projects behind your back and against your interests. The interaction may be giving you the illusion that they care about you, that they want to provide you with information and context, and that they're taking the time for you to engage in conversation with them, while their true intentions are completely different. 

One last comment (from the theory of science POV)

One of the ways I like to think of the alignment problem is in terms of explanation. Specifically, what are the kinds of explanations that we would need to have to say that we understand a given system sufficiently to steer it in a human-compatible direction? I find that this is a good question to ask, both when thinking separately about techniques and their effectiveness, for example, chain-of-thought, and when thinking about combining different techniques and methodologies, and for zooming out to see the big picture of the alignment problem.

Despite this being an interview with successful interpretability researchers, it is implied that interpretability is not the single ultimate pathway to explaining these systems. This can have a pretty pessimistic interpretation for how likely it is for safety to go well, but I focus on the positive: both researchers have a very pragmatic attitude towards finding what would actually work, even if it's counterintuitive or outside their main agenda.

Regardless of the techniques used and the specifics of any given project, the field would likely benefit from more systematic attempts to clearly articulate what kind of explanation is given through any result or theoretical framing.

  1. ^

    I'm not giving that context here, but the Stanford Encyclopedia of Philosophy entries on scientific explanation are a good start.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI可解释性 机制可解释性 AGI安全 链式思考 代理性可解释性 AI伦理 AI研究 AI Explainable AI Mechanistic Interpretability AGI Safety Chain-of-Thought Agentic Interpretability AI Ethics AI Research
相关文章