训练稀疏权重模型以发现Transformer中的可解释电路

少点错误前天 02:36

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

研究提出了一种新颖的方法，通过训练Transformer模型使其具有稀疏权重，从而发现可解释的电路。这种方法生成的模型包含高质量的电路，这些电路是全局的、细粒度的，并且通常足够简单，可以直接在白板上描绘。虽然这种方法训练的稀疏语言模型成本高昂，但其主要潜力在于能够训练出完全可解释的中等规模模型，从而为理解认知理论提供帮助。初步结果也表明该方法可用于解释现有密集模型。

💡 **稀疏权重与可解释电路：** 文章提出了一种通过约束模型权重为零来训练Transformer的方法，旨在生成具有高度可解释性的“电路”。这种稀疏性使得每个神经元仅连接少数其他节点，从而简化了模型的内部运作逻辑，便于人类理解。

🌟 **电路的质量与粒度：** 这种方法生成的电路具有显著的优势，包括其全局性（不依赖于特定数据点）、细粒度（精确到单个神经元和注意力通道，而非整个MLP层或注意力头）以及简洁性（通常足够简单，可完全绘制在白板上），这使得对模型内部机制的理解更加深入和直观。

💰 **成本与未来展望：** 尽管这种从头开始训练稀疏语言模型的方法在成本上非常高昂，可能难以直接应用于预训练前沿模型，但其主要价值在于未来能够扩展此方法来训练一个完全可解释的中等规模模型。研究还展示了将此方法应用于解释现有密集模型的初步成果。

📊 **能力与可解释性的权衡：** 研究发现，增加权重的稀疏性会在一定程度上牺牲模型的性能（能力），而增加模型规模则有助于改善能力与可解释性的平衡点。然而，在保持可解释性的同时，将稀疏模型扩展到数千万个非零参数以上仍然是一个挑战。

Published on November 13, 2025 6:30 PM GMT

TL;DR: We develop a novel method for finding interpretable circuits in Transformers, by training them to have sparse weights. This results in models that contain very high quality circuits: our circuits are global rather than datapoint dependent; we explain the circuit down to very granular objects, like individual neurons and attention channels, rather than entire MLP layers, attention heads, or groups of nodes; and the circuits are often simple enough to draw in their entirety on a whiteboard. The downside is that our method produces de novo sparse language models, which are extremely expensive to train and deploy, making it unlikely that we will ever be able to use this method to directly pretrain frontier models. We share preliminary results on using sparse models to explain an existing dense model, but our main theory of impact is to eventually scale our method to train a fully interpretable moderate-sized model. If we could fully interpret even (say) a GPT-3 level intelligence, it could aid dramatically in developing a theory of cognition in general.

[Blog] [Paper] [Code]

Abstract

Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task.
These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting that our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签