cs.AI updates on arXiv.org 11月12日 13:20
语言模型自我解释能力研究
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究了语言模型(LMs)是否能够学习对其内部计算进行忠实描述,并探讨了其自我解释能力相对于其他模型的优劣。通过使用现有可解释性技术作为基准,对LMs进行微调,生成对LM特征信息、内部激活的因果结构和特定输入对输出影响的自然语言描述。结果表明,LMs能够可靠地解释其内部计算,且这种解释方法为现有可解释性方法提供了可扩展的补充。

arXiv:2511.08579v1 Announce Type: cross Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 自我解释 可解释性 内部计算 自然语言描述
相关文章