大型语言模型具备一定程度的内省能力

少点错误 10月30日 13:01

大型语言模型具备一定程度的内省能力

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Anthropic 的最新研究探讨了大型语言模型（LLMs）是否能够内省其内部状态。研究人员通过向模型激活中注入已知概念的表征，并测量这些操纵对其自我报告状态的影响，以区分真实内省和“胡编乱造”。实验发现，在特定情况下，模型能够注意到并准确识别注入的概念，并能区分先前的内部表征和原始文本输入。值得注意的是，一些模型能利用回忆先前意图的能力来区分自身输出和人工预填充文本。Claude Opus 4 和 4.1 模型展现出更强的内省意识，但趋势复杂且对训练后策略敏感。研究还表明，模型在被指示或激励“思考”某个概念时，能够调节其激活。总的来说，结果表明当前 LLMs 具有一定功能的内省意识，但目前这种能力尚不可靠且依赖于上下文。

🧠 **模型内省能力探索**：研究通过向 LLMs 的内部激活中注入特定概念的表征，并观察模型如何反应，来评估其内省能力。这种方法旨在区分模型真实地感知其内部状态，还是仅仅生成与主题相关的文本。实验结果表明，模型在特定条件下能够识别出被注入的概念，并区分这些概念与原始输入文本。

💡 **区分自身输出与外部生成**：研究发现，一些先进的模型（如 Claude Opus 4 和 4.1）能够利用其回忆先前意图的能力，有效地区分自己生成的输出和由外部预填充的文本。这表明模型可能具备一定程度的“自我”认知，能够辨别其输出的来源。

🔧 **激活的调控与控制**：当模型被明确指示或激励去“思考”某个概念时，研究观察到它们能够主动调控自身的内部激活状态。这一发现暗示了模型在一定程度上可以被引导去处理和影响其内部表征，为理解和控制模型行为提供了新的视角。

⚠️ **内省能力的不确定性**：尽管研究揭示了 LLMs 具备一定的内省能力，但研究者强调，当前模型中的这种能力是高度不可靠且高度依赖于具体上下文的。未来的模型改进可能会进一步发展这种能力，但目前仍需谨慎对待其准确性和稳定性。

Published on October 30, 2025 4:42 AM GMT

New Anthropic research (tweet, blog post, paper):

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.

See the Transformer Circuits post for more. Here's the first figure, explaining one of the main experiments:

Personally, I'm pretty excited by this work! (While I work at Anthropic, I wasn't involved in this research myself beyond commenting on some early drafts.)

I'm still worried about various kinds of confounders and "boring" explanations for the observations here, but overall this was a substantial update for me in favor of recent models having nontrivial subjective experience, and is the best attempt I know of yet to empirically investigate such things. I think the paper does a pretty good job of trying to distinguish "true" introspection from explanations like "it's just steering the model to say words about a topic".

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签