少点错误 08月11日
Alternative Models of Superposition
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探索了“玩具模型中的叠加”(TMS)在表示特征方面的能力,并提出了对普遍假设的挑战。研究人员通过一个修改版的玩具模型,移除了ReLU非线性激活函数,并采用了一种只关注重构激活特征的损失函数,成功地将数十个特征压缩到二维空间中。这一发现表明,在某些条件下,即使没有非线性激活函数,模型也能实现特征叠加,并且特征数量不一定受限于空间维度中的“几乎正交”向量数量。研究结果强调了在研究模型时,需要考虑损失函数设计和模型假设对叠加现象的影响,并鼓励探索更多样的玩具模型以获得更全面的理解。

💡 玩具模型中的叠加现象(TMS)表明,模型可以在其维度限制之外表示更多特征。原始模型展示了如何在秩为2的矩阵中表示5个特征,这引发了对特征表示上限的思考。

🚀 本文通过修改TMS模型,去除了ReLU非线性激活函数,并引入了仅重构激活特征的损失函数,成功地在二维空间中表示了数十个特征。这挑战了“特征数量受限于空间中几乎正交向量数量”的假设。

🧠 研究发现,元素级非线性激活函数并非特征叠加的必要条件。在特定的损失函数设计下(如仅关注目标特征的损失),线性模型也能实现特征的有效压缩和表示,这与TMS提出的“叠加不发生在线性模型中”的观点形成对比。

📊 实验结果表明,通过调整训练数据的稀疏性(即非目标特征出现的概率),可以影响模型的特征表示能力。更高的稀疏性有助于提高模型准确表示特征的能力,但高干扰成本依然存在。

⚠️ 研究强调了研究更多样化玩具模型的重要性,因为不同的模型假设和设计选择(如损失函数)会产生截然不同的结果。研究者在推广玩具模型的研究结论时应持谨慎态度,注意其普适性。

Published on August 11, 2025 3:52 PM GMT

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted these experiments during continued work following the SPAR Spring 2025 cohort.

Disclaimer / Epistemic status: We spent roughly 30 hours on this post. We are not confident in these findings but we think they are interesting and worth sharing.

We assume some basic familiarity with the superposition hypothesis from Elhage et al., 2022, but have an additional "Preliminaries" section below to provide background.

Summary

Toy Models of Superposition (2022)—which we call TMS for short—demonstrates that a toy autoencoder can represent 5 features in a rank 2 matrix. In other words, the paper shows that models can represent more features than they have dimensions. While the original model proposed in the paper has since been extensively studied, there are alternative toy models which may give different results. In this post, we use a loss function that focuses only on reconstructing active features and find that this variation allows us to squeeze dozens of features into two dimensions without a non-linearity. 

100 feature directions in two-dimensional space

We do not claim to find anything extraordinary or groundbreaking. Rather, we attempt to challenge two potential assumptions about superposition that we don't believe hold universally:

In this post, we show empirically that neither of these claims is universally true by designing toy models where they no longer hold.

Preliminaries

TMS introduces a model that, despite consisting of only 15 trainable parameters, is intriguingly complex. The model is defined by  where  and . The authors show that when  is sparse enough (i.e., the probability that  is high), the model can represent five features even though .

When sparsity is sufficiently high, the probability that more than one input feature will be present is extremely low. If there is a 5% chance that each  is non-zero, there is ~2% chance that more than one feature will be active but a ~20% chance that only one feature will be present. This means that the vast majority of non-zero inputs to the model will have a single feature present. 

Consider input . Calculating  will give us the second column of the  matrix. Then, when we take , we are computing the dot product of the second column with every other column in . Assuming that the vector length of each column in   is the same, the most active output will be the second feature (this is just a property of the dot product). The same logic applies if the non-zero feature is any positive value.

When you train the original model from TMS, it does not always represent all five features, but when it does, the results can be remarkably clean. In the figure below, we show the five columns of the weight matrix, each as a direction in two-dimensional space. We assign each color to a feature. For example, the direction of the second feature's weights is represented as a blue arrow. Areas in the hidden layer that result in the second feature having the highest output are shown in a lighter blue. Finally, the dark blue dot marks the location in the hidden space that  maps to.

Five features in superposition. This is similar to a figure in the introduction of Elhage et al., 2022 

If you view this toy model as a kind of classifier, then the colored areas above represent a kind of classification boundary for the latent space of the model.

Role of Non-linearities

When the toy model described above represents features in superposition, there is the opportunity for interference. Concretely, when we pass  to the model, we won't get a one-hot vector. The second output will be the highest, but the other outputs will give other positive or negative values.  

The negative output values correspond to features with weights that point in an opposite direction from the second feature in vector space. In TMS, the model is trained with a mean squared error loss, meaning these negative values are quite costly when they should be zero. The ReLU non-linearity conveniently filters these negative values, allowing for non-orthogonal representation of features without high reconstruction loss. In the linear case, however, the interference cannot be eliminated so superposition is extremely costly. We claim that for the TMS model described above, the ReLU doesn’t do computation as much as it makes interference less punishing. In other words, the model doesn't use the ReLU to learn a non-linear rule. Rather, it removes the negative values which are especially punishing to the reconstruction loss. 

This suggests we don't need the ReLU at all: ReLUs help bring loss closer to 0, but don't help identify the maximum feature.[2]Any training objective that values accurate representation of active features over tolerating noise from non-orthogonal representations should represent features in superposition. 

Alternative Model of Superposition

We show that you can have superposition with no element-wise non-linearity and a different loss. We change the original TMS setup to have no ReLU and more features. We initialize  and  such that the model takes 100 inputs rather than 5. During training, every example includes a single target where the target input is sampled from a uniform distribution between 0 and 1. We let there be a probability   that every other feature will also be present. Non-target features which are selected are sampled uniformly between 0 and 0.1. Next, we change the reconstruction loss to only include the target class  rather than the entire output:[3]

Note that this gives us a model that appears to be entirely linear. The "magic" here is the ability to select only a single term in the loss, but this is not a non-linearity in the traditional sense. Another perspective is that the loss above includes an implied selection or maximum at the end of the network that filters the noise from superposition and serves as a covert non-linearity. We believe it is still appropriate to call this model "linear" but note that the loss is doing some heavy lifting.

If we let , there is more than a 99% chance that at least one of the 99 non-target features will be nonzero. This means that although the loss focuses on the target input, the model still has to handle the interference cost of non-target input features. In this setup, despite having some interference on almost all training examples, the model does a reasonably good job of representing some direction for all 100 features:

Toy model with 100 dimensions and p=0.95

We evaluate the classification accuracy of the model by taking a one-hot vector for each input feature and feeding it to the model. If the top output is the same class as the target feature, we say the model represents the target "accurately." In the above figure, the model represents 31 classes accurately. Incorrect classifications happen when there is a weight vector that is adjacent in the vector space but is slightly longer than the weight vector for the target feature.

By increasing sparsity, you can increase the classification accuracies of the model. At , it is fairly easy to accurately represent more than 50 features. However, when  is higher, it is much harder to get results as clean as the example shown above.

Implications

Superposition, Memorization, and Double Descent contains some figures that appear similar to this post, but are different in important ways. The paper shows that under some conditions, models can memorize arbitrary numbers of data points (relevant code from the paper here). Our experiments, however, focus on representations of features, rather than datapoints. Our setup is designed such that there is some interference on all training examples and, in some cases, gradients will lead the model in the wrong direction. Despite this, the model learns the underlying structure of the data.

Our findings may have implications for techniques that decompose latent activations (e.g., Bricken et al., 2023) or weights (e.g., Bushnaq et al., 2025). These techniques face the potential issue of not being able to safely determine the maximum number of features, circuits, etc. to look for. 

The experiment also highlights the importance of studying a more diverse set of toy models. Slightly different assumptions produce wildly different models, so it is important to track which design choices are relevant to reproduce which results. In our example, we show that non-linearities are essential to superposition only given certain assumptions about the loss function. Researchers should be cautious when claiming results from toy models hold generally.

Once again, these experiments were done quickly and don't reflect the rigor of a paper or preprint. We are happy to be proven wrong by those who have thought deeply about this topic.

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted this experiment following the SPAR Spring 2025 cohort. Rick suggested the heatmap-style explanation for toy models which was essential for making the insights in this post. Zephaniah conducted the experiments, made key insights and wrote this post.

  1. ^

    For a concrete definition of "almost-orthogonal," see the Superposition Hypothesis section of TMS. 

  2. ^

    In a deep network, bringing activations to 0 can remove interference, so we don't claim that ReLU are always unnecessary. In fact, ReLUs almost certainly learn non-linear rules in addition to filtering noise.

  3. ^

    This loss is admittedly less conventional than a traditional mean squared error loss. We do note that losses like cross entropy use only the target class probability which was part of our motivation. The and  Carlini-Wagner losses do something somewhat analogous as well.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

玩具模型 特征叠加 线性模型 损失函数 人工智能
相关文章