少点错误 前天 02:36
训练稀疏权重模型以发现Transformer中的可解释电路
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究提出了一种新颖的方法,通过训练Transformer模型使其具有稀疏权重,从而发现可解释的电路。这种方法生成的模型包含高质量的电路,这些电路是全局的、细粒度的,并且通常足够简单,可以直接在白板上描绘。虽然这种方法训练的稀疏语言模型成本高昂,但其主要潜力在于能够训练出完全可解释的中等规模模型,从而为理解认知理论提供帮助。初步结果也表明该方法可用于解释现有密集模型。

💡 **稀疏权重与可解释电路:** 文章提出了一种通过约束模型权重为零来训练Transformer的方法,旨在生成具有高度可解释性的“电路”。这种稀疏性使得每个神经元仅连接少数其他节点,从而简化了模型的内部运作逻辑,便于人类理解。

🌟 **电路的质量与粒度:** 这种方法生成的电路具有显著的优势,包括其全局性(不依赖于特定数据点)、细粒度(精确到单个神经元和注意力通道,而非整个MLP层或注意力头)以及简洁性(通常足够简单,可完全绘制在白板上),这使得对模型内部机制的理解更加深入和直观。

💰 **成本与未来展望:** 尽管这种从头开始训练稀疏语言模型的方法在成本上非常高昂,可能难以直接应用于预训练前沿模型,但其主要价值在于未来能够扩展此方法来训练一个完全可解释的中等规模模型。研究还展示了将此方法应用于解释现有密集模型的初步成果。

📊 **能力与可解释性的权衡:** 研究发现,增加权重的稀疏性会在一定程度上牺牲模型的性能(能力),而增加模型规模则有助于改善能力与可解释性的平衡点。然而,在保持可解释性的同时,将稀疏模型扩展到数千万个非零参数以上仍然是一个挑战。

Published on November 13, 2025 6:30 PM GMT

TL;DR: We develop a novel method for finding interpretable circuits in Transformers, by training them to have sparse weights. This results in models that contain very high quality circuits: our circuits are global rather than datapoint dependent; we explain the circuit down to very granular objects, like individual neurons and attention channels, rather than entire MLP layers, attention heads, or groups of nodes; and the circuits are often simple enough to draw in their entirety on a whiteboard. The downside is that our method produces de novo sparse language models, which are extremely expensive to train and deploy, making it unlikely that we will ever be able to use this method to directly pretrain frontier models. We share preliminary results on using sparse models to explain an existing dense model, but our main theory of impact is to eventually scale our method to train a fully interpretable moderate-sized model. If we could fully interpret even (say) a GPT-3 level intelligence, it could aid dramatically in developing a theory of cognition in general.

[Blog] [Paper] [Code]

Abstract

Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task.
These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting that our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Transformer 可解释性 稀疏模型 注意力机制 深度学习 Interpretability Sparse Models Attention Mechanism Deep Learning
相关文章