少点错误 09月07日
探究语言模型中的信息编码与解码机制
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本研究深入探讨了编码器-解码器模型在处理语言信息时的关键机制。通过一系列实验,我们聚焦于模型如何将文本信息压缩至瓶颈层嵌入(bottleneck embedding),以及修改输入文本或嵌入如何影响解码输出。研究发现,尽管模型能够高效地跨语言进行翻译,其嵌入中仍保留了源语言信息和位置信息,这些信息对解码过程至关重要。此外,模型在解码时倾向于生成语法正确的文本,即使嵌入已被扰动。这些发现为理解大型语言模型(LLMs)的内部运作提供了有价值的视角。

🌐 **语言信息在嵌入中的多维度编码**:研究表明,编码器-解码器模型的瓶颈嵌入不仅存储了文本的语义信息,还包含了源语言的细微差别以及标记在序列中的位置信息。即使在跨语言翻译任务中,源语言信息也以一种可识别的方式被编码,并且位置信息对模型理解和重构序列至关重要,尤其是在处理非语义化序列时。

🔄 **解码器的鲁棒性与偏好**:实验证明,模型的解码器表现出高度的鲁棒性。即使输入嵌入在语言或位置上有所改变,解码器也能在一定程度上生成有意义的文本。特别值得注意的是,解码器倾向于生成符合目标语言语法规则的输出,即使这意味着偏离了被操纵的嵌入所指示的直接含义,显示出对生成有效文本的内在偏好。

🔬 **模型可解释性与类比LLMs**:通过对嵌入空间进行可视化和向量操作,研究发现模型中的某些维度可能独立编码了特定的概念,类似于大型语言模型(LLMs)的残差流(residual stream)。这种结构上的相似性表明,对这些更简单的编码器-解码器模型的研究,可以为深入理解LLMs的复杂动态提供有益的参考和基础。

🧰 **实验的局限性与未来方向**:尽管研究取得了一些初步成果,但作者也承认实验样本量和语言多样性的局限性。未来的工作可以扩展到更多语言对,更复杂的文本,以及更深入的数学分析,以更全面地验证这些发现,并探索更精细的嵌入操纵技术。

Published on September 6, 2025 8:07 PM GMT

Introduction and Motivation

As part of the 5th iteration of the Arena AI safety research program we undertook a capstone project. For this project we ran a range of experiments on an encoder-decoder model, focusing on how information is stored in the bottleneck embedding and how modifying the input text or embedding impact the decoded text. Some of the findings are shown in this blog post, and the companion post by @Samuel Nellessen here. We helped each other throughout, but most of his work is documented in his post, and this post documents my findings. This work was also supported by @NickyP and was in part motivated by this

Although created for the purposes of language translation and not complex reasoning, studying this type of model is interesting for a few reasons:

The Model

META’s SONAR model is a multi-lingual text and speech encoder and text decoder model trained for translation and speech-to-text purposes. In this study we look only at the text encoding and decoding and ignore the multi-modal capabilities of the model. 

The model is composed of:

Showing an example of an encoder-decoder model. Image modified from here

The encoder and decoder begin with a language token which is used by the model to properly store the information in the embedding, and to correctly translate the meaning of the input text into the target language.

Investigations

How Does the Input Language Impact the Bottleneck Embedding

Motivation

When encoding some text we begin the sequence with a language token to tell the encoder which language is being input, and when decoding we do the same to tell the decoder which language to generate the output in. In this way we can use the same model to encode/decode from/to any of the large choice of languages (see here).

An open question was to check how the input language of choice impacts the embedding, is the embedding language agnostic? We might expect one of the following:

Experiment

For this experiment we take about 300 English sentences and their Spanish translations. We embed each of the samples, two embeddings each (one for each language). Using this we can then run a bunch of quick experiments. Throughout we refer to "English embedding" as an embedding generated from English input text, regardless of if that information is stored in the embedding (and similarly for "Spanish embedding"). Note that decoding an English embedding back in to English seems to always generate the same text word for word (and similarly for Spanish embeddings decoded into Spanish).

Showing the first two principle components of the embeddings. Clear small separation between the embeddings from English and Spanish. Grey lines connect the English embedding to the corresponding Spanish embedding. The green arrow points from the centroid of the English embeddings to the centroid of the Spanish embeddings.

Limitations and Further Work

This was a quick initial experiment that could do with deeper analysis. Some limitations and open questions:

Conclusion

It appears that the source language information is encoded in the bottleneck embedding, although why is unclear. It could be an artefact of the encoder that has no impact, or maybe it is useful for the decoder in situations that weren't covered by the examples we tested. The main takeaway is that the source language does impact the embedding, and that this information could be used by the decoder if needed.

How is Positional Information Stored in the Embedding

Motivation

Can we learn something about how positional information (token position in the input sequence) is encoded. Encoder-decoder models work well with semantically meaningful sequences, but they can also embed and decode other types of sequences. They must have some way of encoding a sequence in a way that differs from how they embed meaningful language. How do they do this? Can we manipulate the embedding to control the decoder's output sequence?

Experiment

To begin with, we visualise how positional information is stored. If we take a sequence like "dog ", we can encode it to get an embedding. Note that "dog" is a single token, as is the token used for filler "", the spaces in the string are just for clarity and are not present in the input sequence. We can then shift the token for dog repeatedly to get input sequences " dog ", " dog _" etc. We then look at how the embeddings change as we shift the token along. Plotting the embeddings for about 20 positions with PCA we can see that the embedding vector appears draw out a circular pattern or an arc. The distance between subsequent embeddings gets smaller as the position of the token increases - this makes sense as for super long sequences you could imagine that the exact position matters less. 

Showing the path tracked out by shifting the token position. One curve is for the token "dog", another is "cat", and another is "speak". The filler token is always ""

Note that decoding works perfectly for these kinds of sequences. Testing on super long sequences (say above 30 tokens) does begin to show some degradation, where the decoded token position is wrong, or we just get the filler token only.

We see a nice pattern here, can we get a general vector that we can use to tweak an embedding to control the decoded sequence? We take the embedding for " dog ", subtract from it the vector from the embedding from " dog ", which hopefully then is a vector that if added to an embedding, will shift token in the decoded sequence to the right by one. We indeed find that adding this to the embedding for " cat " and decoding does in fact give " cat ".

These vectors are not so useful however because they only work if the filler tokens are the same, so are not useful for arbitrary sequences. In fact if we change the filler token, the pattern above completely changes. There appears to always be a path followed as we shift the tokens, and always arc-like, but they are always different. This means that there isn't an obvious general arbitrary way to control the token position.

Conclusion

We can in principle see a pattern to how positional information is encoded in the embedding, and it appears that earlier positions are more important that later ones. Although the vectors to shift token positions work, it is only in very specific situations and probably not useful. It is interesting that the embeddings aren't completely arbitrary, they can in principle be inspected. The fact that a vector shifting a token works for any other token wasn't known a priori. There is definitely more room here for experimentation.

Can we Manipulate the Embedding to Arbitrarily Change Tokens

Motivation

We want to investigate how is the information about particular input tokens stored. How cleanly is the information embedded. Is if possible to manipulate the embedding to have fine grained control on the decoded sequence?

Experiment

Here we look at changing a particular token in a sequence to another by manipulating the embedding. Can we manipulate the embedding of "dog a bit dog b" so that it decodes to "dog a bit cat b"?

We first try to do this by calculating the shifting vector by subtracting the embedding for  " dog " from " cat ". If we add this vector to the embedding for "dog a bit dog b" we decode to "cat a bit cat b". This is interesting, by default we get a general "dog" to "cat" direction. Further experiments showed that indeed all instances of "dog" become "cat" regardless of the input sequence. The positional information here is ignored.

Instead if we do the same, but subtract "dog dog " from "dog cat ", then this does in fact only change the token from "dog" to "cat" only for the "dog" in fourth position. For example we can apply this to "dog is happy dog now" to get "dog is happy cat now". We successfully leave the "dog" in the first position untouched. This tells us that we have captured more than just a meaning difference between "dog" and "cat" but also some positional information i.e. more evidence that positional information is stored in the embedding (as we can see in the previous section).

Interestingly this only works for grammatically correct sentences. If we try to do the same approach to change "dog a bit dog b" to "dog a apple dog b" (i.e. by subtracting the embedding for  " apple " from " bit _" and adding it to the embedding from "dog a bit dog b") then it fails, it decodes to "dog a likes dog b". This implies that even though we can manipulate the embedding to change a token, and even a token at a given position, we are at the mercy of what the decoder does. It seems to have a preference for generating useful/valid text and doesn't want to put a noun where a verb should be.

Conclusion

We can:

Conclusions

For experiment specific conclusions see the conclusion sections for each of individual investigations above.

Quick summary:

An overall result is that the bottleneck embedding appears to behave somewhat like the residual stream in LLMs in the sense that there seem to be orthogonal directions that carry semantic meaning. Due to this, along with their simplicity, studying these models could be useful for better understanding some LLM dynamics. 

We did a wide range of investigations to probe for interesting behaviour and interesting initial findings. Given the limited time we had, we didn't have time to dig deeper or rigorously prove our findings. Take my conclusions here with a grain of salt, I believe them to be the case based on my results, but they could just be rationalisations. This definitely could do with more work and I would be happy if someone would expand on this. For anyone interested, there is a utility class for working with the Sonar model and some scripts that you can steal from to get started on your own experiments.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

编码器-解码器模型 语言模型 AI安全 嵌入 自然语言处理 可解释性AI Encoder-Decoder Models Language Models AI Safety Embeddings NLP Explainable AI
相关文章