Alternative Models of Superposition

Published on August 11, 2025 3:52 PM GMT

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted these experiments during continued work following the SPAR Spring 2025 cohort.

Disclaimer / Epistemic status: We spent roughly 30 hours on this post. We are not confident in these findings but we think they are interesting and worth sharing.

We assume some basic familiarity with the superposition hypothesis from Elhage et al., 2022, but have an additional "Preliminaries" section below to provide background.

Summary

Toy Models of Superposition (2022)—which we call TMS for short—demonstrates that a toy autoencoder can represent 5 features in a rank 2 matrix. In other words, the paper shows that models can represent more features than they have dimensions. While the original model proposed in the paper has since been extensively studied, there are alternative toy models which may give different results. In this post, we use a loss function that focuses only on reconstructing active features and find that this variation allows us to squeeze dozens of features into two dimensions without a non-linearity.

100 feature directions in two-dimensional space

We do not claim to find anything extraordinary or groundbreaking. Rather, we attempt to challenge two potential assumptions about superposition that we don't believe hold universally:

Assumption #1:

The number of features represented in a space is bounded above by the number of almost-orthogonal

^[1]

vectors that fit in that space.

TMS makes a related claim which is less strong: "Concretely, in the superposition hypothesis, features are represented as almost-orthogonal directions in the vector space of neuron outputs."

Assumption #2:

Element-wise non-linearities perform the computation which makes superposition possible.

TMS claims that superposition "doesn't occur in linear models."

In this post, we show empirically that neither of these claims is universally true by designing toy models where they no longer hold.

Preliminaries

TMS introduces a model that, despite consisting of only 15 trainable parameters, is intriguingly complex. The model is defined by $R e L U (W^{⊤} W x + b)$ where $W \in R^{2 \times 5}$ and $b \in R^{5}$ . The authors show that when $x$ is sparse enough (i.e., the probability that $x_{i} = 0$ is high), the model can represent five features even though $R a n k (W^{⊤} W) = 2$ .

When sparsity is sufficiently high, the probability that more than one input feature will be present is extremely low. If there is a 5% chance that each $x_{i}$ is non-zero, there is ~2% chance that more than one feature will be active but a ~20% chance that only one feature will be present. This means that the vast majority of non-zero inputs to the model will have a single feature present.

Consider input $x = [0, 1, 0, 0, 0]^{⊤}$ . Calculating $W x$ will give us the second column of the $W$ matrix. Then, when we take $W^{⊤} (W x)$ , we are computing the dot product of the second column with every other column in $W$ . Assuming that the vector length of each column in $W$ is the same, the most active output will be the second feature (this is just a property of the dot product). The same logic applies if the non-zero feature is any positive value.

When you train the original model from TMS, it does not always represent all five features, but when it does, the results can be remarkably clean. In the figure below, we show the five columns of the weight matrix, each as a direction in two-dimensional space. We assign each color to a feature. For example, the direction of the second feature's weights is represented as a blue arrow. Areas in the hidden layer that result in the second feature having the highest output are shown in a lighter blue. Finally, the dark blue dot marks the location in the hidden space that $x = [0, 1, 0, 0, 0]^{⊤}$ maps to.

Five features in superposition. This is similar to a figure in the introduction of Elhage et al., 2022

If you view this toy model as a kind of classifier, then the colored areas above represent a kind of classification boundary for the latent space of the model.

Role of Non-linearities

When the toy model described above represents features in superposition, there is the opportunity for interference. Concretely, when we pass $x = [0, 1, 0, 0, 0]^{⊤}$ to the model, we won't get a one-hot vector. The second output will be the highest, but the other outputs will give other positive or negative values.

The negative output values correspond to features with weights that point in an opposite direction from the second feature in vector space. In TMS, the model is trained with a mean squared error loss, meaning these negative values are quite costly when they should be zero. The ReLU non-linearity conveniently filters these negative values, allowing for non-orthogonal representation of features without high reconstruction loss. In the linear case, however, the interference cannot be eliminated so superposition is extremely costly. We claim that for the TMS model described above, the ReLU doesn’t do computation as much as it makes interference less punishing. In other words, the model doesn't use the ReLU to learn a non-linear rule. Rather, it removes the negative values which are especially punishing to the reconstruction loss.

This suggests we don't need the ReLU at all: ReLUs help bring loss closer to 0, but don't help identify the maximum feature.^[2]Any training objective that values accurate representation of active features over tolerating noise from non-orthogonal representations should represent features in superposition.

Alternative Model of Superposition

We show that you can have superposition with no element-wise non-linearity and a different loss. We change the original TMS setup to have no ReLU and more features. We initialize $W$ and $b$ such that the model takes 100 inputs rather than 5. During training, every example includes a single target where the target input is sampled from a uniform distribution between 0 and 1. We let there be a probability $p$ that every other feature will also be present. Non-target features which are selected are sampled uniformly between 0 and 0.1. Next, we change the reconstruction loss to only include the target class $t$ rather than the entire output:^[3]

ℓ (x, t) = (f (x_{t}) - x_{t})^{2}

Note that this gives us a model that appears to be entirely linear. The "magic" here is the ability to select only a single term in the loss, but this is not a non-linearity in the traditional sense. Another perspective is that the loss above includes an implied selection or maximum at the end of the network that filters the noise from superposition and serves as a covert non-linearity. We believe it is still appropriate to call this model "linear" but note that the loss is doing some heavy lifting.

If we let $p = 0.95$ , there is more than a 99% chance that at least one of the 99 non-target features will be nonzero. This means that although the loss focuses on the target input, the model still has to handle the interference cost of non-target input features. In this setup, despite having some interference on almost all training examples, the model does a reasonably good job of representing some direction for all 100 features:

Toy model with 100 dimensions and p=0.95

We evaluate the classification accuracy of the model by taking a one-hot vector for each input feature and feeding it to the model. If the top output is the same class as the target feature, we say the model represents the target "accurately." In the above figure, the model represents 31 classes accurately. Incorrect classifications happen when there is a weight vector that is adjacent in the vector space but is slightly longer than the weight vector for the target feature.

By increasing sparsity, you can increase the classification accuracies of the model. At $p = 0$ , it is fairly easy to accurately represent more than 50 features. However, when $p$ is higher, it is much harder to get results as clean as the example shown above.

Implications

Superposition, Memorization, and Double Descent contains some figures that appear similar to this post, but are different in important ways. The paper shows that under some conditions, models can memorize arbitrary numbers of data points (relevant code from the paper here). Our experiments, however, focus on representations of features, rather than datapoints. Our setup is designed such that there is some interference on all training examples and, in some cases, gradients will lead the model in the wrong direction. Despite this, the model learns the underlying structure of the data.

Our findings may have implications for techniques that decompose latent activations (e.g., Bricken et al., 2023) or weights (e.g., Bushnaq et al., 2025). These techniques face the potential issue of not being able to safely determine the maximum number of features, circuits, etc. to look for.

The experiment also highlights the importance of studying a more diverse set of toy models. Slightly different assumptions produce wildly different models, so it is important to track which design choices are relevant to reproduce which results. In our example, we show that non-linearities are essential to superposition only given certain assumptions about the loss function. Researchers should be cautious when claiming results from toy models hold generally.

Once again, these experiments were done quickly and don't reflect the rigor of a paper or preprint. We are happy to be proven wrong by those who have thought deeply about this topic.

Zephaniah Roe (mentee) and Rick Goldstein (mentor) conducted this experiment following the SPAR Spring 2025 cohort. Rick suggested the heatmap-style explanation for toy models which was essential for making the insights in this post. Zephaniah conducted the experiments, made key insights and wrote this post.

^{^}
For a concrete definition of "almost-orthogonal," see the Superposition Hypothesis section of TMS.
^{^}
In a deep network, bringing activations to 0 can remove interference, so we don't claim that ReLU are always unnecessary. In fact, ReLUs almost certainly learn non-linear rules in addition to filtering noise.
^{^}
This loss is admittedly less conventional than a traditional mean squared error loss. We do note that losses like cross entropy use only the target class probability which was part of our motivation. The $f_{4}$ and $f_{5}$ Carlini-Wagner losses do something somewhat analogous as well.

Discuss

Summary

Preliminaries

Role of Non-linearities

Alternative Model of Superposition

Implications

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签