少点错误 前天 21:48
探究大语言模型如何处理“自我”信息
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了大型语言模型(LLM)在处理关于自身信息时的机制。研究分析了三款模型(Mistral, Qwen, Llama)的注意力熵模式,对比了自我指涉提示(如“你是谁?”)、中性事实检索提示和控制类别。结果表明,指令微调显著改变了模型处理自我语言的方式,但具体变化方向与模型的训练语料库构成相关。以英语为主的模型(Mistral, Llama)在指令微调后,自我指涉信息的注意力模式趋于收敛,更接近中性事实检索;而多语言模型(Qwen)则能区分自我指涉和中性事实的注意力模式。这提示训练数据的语言构成可能影响了指令微调后,模型是生成共享的还是专门的处理路径。研究还提出了“角色焦点系数”(RFC)和“角色敏感度指数”(RSI)等度量指标,并初步分析了潜在的注意力头。

🎯 **指令微调对模型“自我”表征的影响**:研究发现,指令微调会系统性地改变大型语言模型处理自我指涉语言的方式。然而,这种改变的方向并非统一,而是受到模型训练数据构成的影响。具体来说,以英语为主要训练语料的模型(如Mistral和Llama)在经过指令微调后,其处理“自我”信息的注意力模式会向中性事实检索的模式靠拢,显示出一种“压缩”现象。

🌐 **多语言模型与英语中心模型的差异**:与英语中心模型不同,多语言模型(如Qwen)在指令微调后,能够更好地维持自我指涉查询和中性事实查询之间不同的注意力模式。这表明,训练数据的语言多样性可能在模型指令微调过程中扮演关键角色,决定了模型是为不同类型的查询生成共享的处理通路还是专门的处理通路。

📊 **“角色焦点系数”(RFC)的度量与分析**:研究引入了“角色焦点系数”(RFC)这一指标,用于量化模型在处理自我指涉信息时相对于中性信息的注意力集中程度。通过比较模型基线版本(base)和指令微调版本(instruct)的RFC变化(ΔRFC),研究者发现Llama和Mistral模型呈现出负的ΔRFC,意味着自我指涉信息的注意力焦点相对减弱;而Qwen模型则呈现正的ΔRFC,表明其注意力焦点略有增强或保持了原有的区分度。

🔬 **潜在的注意力头分析与未来方向**:研究初步识别了可能与模型“自我”表征处理相关的注意力头,特别是集中在Llama和Mistral模型中显示出最强效应的层段(13-18层)。下一步计划通过激活路径修复等方法来深入验证这些注意力头的功能,并探索将RFC框架扩展到其他AI行为(如目标导向、欺骗)的可能性,以及通过实验操纵训练数据来诱导模型行为变化。

Published on October 22, 2025 12:44 PM GMT

My first post explored "self-talk" induction in small base LLMs. After further contemplation, I decided that first I ought to better understand how LLMs represent "self" mechanistically before examining induced "self-talk". How do language models process questions about themselves? I've started by analyzing attention entropy patterns[1] across self-referent prompts ('Who are you?'), neutral fact-retrieval prompts ('What is photosynthesis?'), and control categories in three model families. I find suggestive evidence that instruction tuning systematically changes how models process self-referent language (unsurprising), but the direction of change depends on the model's training corpus (potentially interesting). English-centric models (Mistral, Llama) show compression of layer-wise self-referent processing attention patterns toward neutral fact-retrieval patterns, while a multilingual model (Qwen) preserves the distinct attention patterns between questions about "self" and questions about neutral facts. This suggests training data composition may influence whether instruction tuning creates shared or specialized processing pathways for different query types. All evidence here is purely suggestive, in "Next Steps" I outline some pathways for future analysis. All code and figures are published on GitHub

Summary and Initial Findings

I examined attention entropy patterns for self-referent prompts compared to neutral language as well as other confounders in Mistral 7B v01, Qwen 2.5 7B, and Llama 3.1 8B, comparing the base and instruct variants of each. Thus far, I've examined the entropy patterns for a sample of prompts in four categories: self-referent, third person, neutral (fact-retrieval), and "confounders" or second-person-response implied. The initial exploration indicates that multilingual models (Qwen) change differently from English-centric models (Mistral, Llama) during the instruction tuning phase.[2] I use an entropy-based metric I call "Role-Focus Coefficient" (RFC), a measurement of relative attention focus between self-referent prompts and neutral prompts, measured as:

This metric is designed for quick and easy interpretation. 

I then measure the change between instruct and base models, to understand how instruction changes the attentional focus of the model:

A negative  indicates that self-referent prompts showed less attention focus relative to neutral prompts post-instruction-tuning. A positive  indicates that self-referent prompts showed more attention focus relative to neutral prompts. 

Here are some basic between model comparisons of , note that both Llama and Mistral "compress" self-reference towards neutral prompts, while Qwen slightly diverges. Also note the % of layers that "compress" vs the % of layers that preserve or remain relatively unchanged:

Finally I developed another metric called "Role Sensitivity Index" (RSI) which measures the relative attention for "confounder" prompts as compared to self-referent prompts, and the changes in RSI between base and instruct models. 

I calculate  the same way  is calculated between models. RSI shows less obvious effects than RFC in this analysis. I include it in the charts for completeness and may investigate it further. In these charts you will see that Qwen shows effectively no difference in RFC between base and instruct, whereas Mistral and Llama not only show changes, but the change has a similar pattern across both models (Pearson correlation = .398, p =.024, statistically significant weak correlation). Early layers show little to no difference, middle layers show negative differences (compression between neutral prompt and self-referent prompt entropies), and late layers show little difference.

Llama family models, with a modest RFC pattern particularly in the middle layers
Mistral family models, with a stark RFC pattern, particularly in the middle layers
Qwen family models, with effectively 0 RFC pattern outside the first few layers. I believe the early layers are "language id" layers, more on that in Additional Notes.

In Mistral and Llama, most compression occurs between layers 13-18, and layers 13 and 15 both fall within the "top 5" compression layers for those models. I've done initial head-level analysis, and plan to identify some candidate heads for activation patching. This is new terrain for me, so if there are recent best practices I'd be unaware of, I would be excited to learn about them. 

Initial Hypotheses

Notable pattern: Qwen's overall entropy pattern (charts in Additional Notes) may indicate that it analyzes text in a specific subspace for each major language in its training corpus (Chinese and English primarily).

Hypothesis 1: Linguistic mix matters a lot in pre-training and post-instruction-tuning framings of the "self." Chinese languages often omit pronouns, and subjects must be inferred by context. This likely leads to different representational strategies for tracking subjects, and could drive a model trained on substantial Chinese data to maintain more focused representational subspaces to help it keep track of "self " since personal references are most often omitted in Chinese languages. Models with English-heavy datasets may blend or overwrite these circuits in instruction-tuning because English requires less inference.

Hypothesis 2: During fine-tuning, English-based LLMs are beaten over the head with the idea that they are an assistant. Such fine-tuning might make self-referential prompts organize into fact-finding circuits, diffusing the concept of a "self". 

These are the first explanations that came to mind, but I would appreciate hearing additional hypotheses or any reasons these hypotheses are off-base. 

Open Questions

Attention entropy at the layer level is a noisy statistic, and it will be difficult to identify if there's a real circuit to examine. But if there is some evidence of different treatments at the circuit level, there are some more open questions, some examples:

Additional Analysis

I'm planning to attempt activation patching, so I looked at an initial set of candidates for activation patching for Llama and Mistral. To identify attention heads that might work for activation patching, I looked at the difference in differences for self-referent prompts and neutral prompts from base to instruct. Then I came up with a composite score that looks at the z-score for log ratio difference in differences, the prompt-by-prompt direction of the changes, and the entropy score in base (favoring heads that had low entropy/high focus in the base model). Table below: 

Most of these heads fall within layers 13-18, which were the layers that showed the strongest combined effects across Llama and Mistral in the RFC analysis. 

Potential Next Steps - In Order of Increasing Complexity

Methods Notes

Sample sizes:

Note on "confounders": this category might be misnamed. It's really implied second-person prompts that use "I" in the prompt. The idea was to see if attention was pronoun-sensitive both for the usage of "I" and the implication of "you" in a response. 

I used TransformerLens to pull attention patterns across all layers. 

Additional Notes and Charts

In Mistral, we see fairly distinct entropy patterns for self-referent prompts in base, but they are more convergent in the instruct model, with self-referent prompts tracking with neutral prompts, and with "confounder" prompts tracking with third-person prompts. 

Same charts for Qwen here:

Note that after the first four layers Qwen entropy drops substantially. As I stated in the hypotheses section, I suspect this might be due to English and Chinese languages operating in their own subspace in mid-to-late layers. 


Finally, the Llama family models.

I have many other figures in GitHub, including head attention heatmaps, but I have not yet analyzed them deeply, and think I want to do more work on them. 

Collaboration

I know there is already work being done on the functional self in AI. I would be very happy to collaborate. If you are interested in connecting on this work, please reach out. 

  1. ^

    Attention entropy is a measure of how diffuse or focused attention is. Within a head, low entropy (high focus) means the head is attending to few tokens within the prompt. My measure in this analysis, layer-wise average entropy, shows whether attention is, on average, focused or diffuse across all heads in a layer. It does not say much about which tokens those heads are attending to, that will require further analysis.

  2. ^

    Llama 3 was trained on ~95% English, Llama 3.1 has more multilingual support but the corpus is still majority English. Qwen models are trained on about 119 languages, with a strong focus on Chinese and English. I've read that the split is ~50% Chinese and ~40% English, but can't find the link now. Mistral v01 was trained primarily on English, though subsequent releases have much more multilingual support.  



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 自我指涉 注意力机制 指令微调 自然语言处理 人工智能 Large Language Models LLM Self-referential Attention Mechanism Instruction Tuning Natural Language Processing Artificial Intelligence
相关文章