语言模型自我解释能力研究

cs.AI updates on arXiv.org 11月12日 13:20

语言模型自我解释能力研究

本文研究了语言模型（LMs）是否能够学习对其内部计算进行忠实描述，并探讨了其自我解释能力相对于其他模型的优劣。通过使用现有可解释性技术作为基准，对LMs进行微调，生成对LM特征信息、内部激活的因果结构和特定输入对输出影响的自然语言描述。结果表明，LMs能够可靠地解释其内部计算，且这种解释方法为现有可解释性方法提供了可扩展的补充。

arXiv:2511.08579v1 Announce Type: cross Abstract: Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型自我解释可解释性内部计算自然语言描述

相关文章

Coalition of news publishers sue Microsoft and OpenAI

This AI Paper by Microsoft and Tsinghua University Introduces YOCO: A Decoder-Decoder Architectures for Language Models

Localizing and Editing Knowledge in LLMs with Peter Hase - #679

OLMo: Everything You Need to Train an Open Source LLM with Akshita Bhagia - #674

Learning Transformer Programs with Dan Friedman - #667

Multilingual LLMs and the Values Divide in AI with Sara Hooker - #651

Transformers On Large-Scale Graphs with Bayan Bruss - #641

BloombergGPT - an LLM for Finance with David Rosenberg - #639

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Studying Machine Intelligence with Been Kim - #571