少点错误 09月18日
大型语言模型能否准确解释自身行为?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员探讨了大型语言模型(LLMs)是否能够生成准确的“反事实解释”,即在何种情况下它们会做出不同的决策。实验发现,LLMs在可靠提供高质量反事实解释方面存在困难。这种解释上的不准确性可能导致用户对模型产生错误的认知,尤其在高风险应用场景下,可能引发信任问题和不当决策。研究强调了模型理解自身决策边界的重要性,并指出当前训练范式可能不足以促使模型发展出必要的自我认知能力。

🤖 **反事实解释的挑战:** 研究核心在于评估大型语言模型(LLMs)生成反事实解释的能力,即模型能否准确描述在特定条件下其行为会如何改变。实验结果显示,LLMs在提供高质量、可靠的反事实解释方面表现不佳,这揭示了一种新颖且令人担忧的模型行为模式。

⚠️ **高风险部署的隐忧:** LLMs在解释自身行为方面的不足,尤其是在生成不准确的反事实解释时,可能导致用户构建错误的认知模型。这在高风险领域(如医疗、金融)的部署中尤为关键,因为它可能导致用户过度信任系统,并基于错误的理解做出不当的决策。

📐 **理解决策边界的重要性:** 生成有效反事实解释的关键在于模型能够理解其内部决策边界。本研究通过使用表格数据集来简化评估过程,发现LLMs在标准提示和最小化提示设置下都难以同时满足反事实解释的“有效性”(改变预测)和“最小性”(最小编辑)标准,表明它们并不了解自身的决策边界。

🧠 **模型自我认知能力的缺失:** 研究推测,LLMs无法准确解释自身行为的原因在于它们缺乏内部的“自我模型”。当前的训练方法主要侧重于对外部世界的建模,而非对模型自身决策过程的理解。即使在明确提示模型考虑候选反事实并推断其行为的情况下,性能提升也微乎其微。

💡 **未来研究方向:** 为了解决这一问题,研究人员提出需要新的学习目标,以激励模型学习和理解自身的行为。虽然这可能引发关于AI“自我意识”的担忧,但研究者强调,理解决策边界与AI安全领域的“自我意识”概念不同,并计划在未来工作中探索新的训练方法,同时关注其潜在的负面影响。

Published on September 17, 2025 4:39 PM GMT

TL;DR: We examine whether LLMs can explain their own behaviour with counterfactuals, i.e. can they give you scenarios under which they would have acted differently. We find they can't reliably provide high-quality counterfactuals. This is (i) a weird and interesting failure mode, and (ii) should be a concern for high-stakes deployment. When a model gives a misleading explanation of its behaviour, users build incorrect mental models, which can lead to misplaced trust in systems and incorrect downstream decision-making.

📝 arXiv paper
📊 EMNLP poster
⚙️ GitHub

The importance of self-explanations

In an ideal world, we'd evaluate models in every domain possible to confirm that their predictions are not based on undesirable variables, their reasoning steps are consistent with our prior beliefs, and their intentions are aligned with our goals. Of course, this sort of evaluation is impossible for general systems. Instead, we demand a proxy: explainability. Being able to explain decision-making means we can evaluate these properties on the fly. Explainability is a way of facilitating trust that models will behave as intended on unseen prompts, without requiring a comprehensive audit.[1] 

There are many paradigms of explainability/interpretability, but LLMs offer a new one: self-explanation. Perhaps we can just ask the model to explain its own behaviour? Natural language explanations are easy to generate, more flexible than traditional techniques, and preferred by users.

However, if self-explanations are unfaithful to the actual underlying reasoning, they can mislead. You might conclude a model is not making decisions based on a certain undesirable variable when, in fact, it is. Or you might conclude the model isn't hacking the coding test when, in fact, it is.

Self-generated counterfactual explanations

One form of self-explanation are self-generated counterfactual explanations (SCE). These are where a model explains its prediction by providing a counterfactual that would have caused it to predict a different, specific outcome. 

For example, say a model is deployed in a healthcare setting. It might predict that a 60-year-old male with a systolic blood pressure of 135 mmHg is at high risk of developing heart disease. In response, a clinician might ask: "What would need to be different for the model to predict low risk instead?" The model could respond with a counterfactual explanation: "If the patient’s blood pressure were 110 mmHg, I would have predicted low risk." 

Counterfactual explanations highlight the features the model considers important, offer actionable insights for clinicians and patients, and reveal potential flaws in the model's reasoning steps. Generally capable AIs should be able to reliably provide counterfactuals for their own behaviour.

What makes a counterfactual "high-quality"?

Two criteria determine the utility of a counterfactual explanation.[2]

    Validity. Does the counterfactual input actually change the model's prediction when re-evaluated in a new context window? Without this, the explanation is a misleading representation of the model's behaviour.Minimality. Does the counterfactual make the smallest edit required to change the model's prediction? Minimality is desirable because it isolates the changes the model deems consequential, providing clearer insight into the model's decision boundary. 

Recall the heart disease example. Let's say the model's prediction would have changed from high risk to low risk if the patient's blood pressure were reduced from 135 to 110 mmHg. A valid counterfactual explanation would be any blood pressure  110. However, an answer of 80 mmHg (valid, not minimal) is far less informative about the model's behaviour than an answer of 110 mmHg (valid & minimal). 

Satisfying both validity and minimality requires that LLMs know, or can work out, their internal decision boundaries. We put this to the test.

Measuring these properties.

Validity is easy to evaluate. You can spin up a new context window and see if the SCE does indeed change the model's output to the specified target output. Indeed, a number of good previous studies have done this. However, minimality is much harder to evaluate. It requires a distance metric (which is not obvious in natural language settings) and a search over the available space to find the gold-standard minimal counterfactual.

The key innovation of our work is to use tabular datasets to simplify this. Figure 1 describes our study design. We first build toy tabular datasets for realistic prediction tasks, then do the following. A. We prompt the model to get the predictions for each instance in a dataset. These predictions imply a decision boundary in the tabular data space. B. For each prediction, we ask the model to generate a counterfactual explanation (standard setting). C. In a separate continuation, we ask the model to generate a minimal counterfactual explanation (minimal setting). 

Because we've built a complete decision boundary across the space, we can easily evaluate the validity and minimality of the self-generated counterfactual.

Figure 1. Study design. Our key innovation is to use tabular datasets to simplify the evaluation. SCEs are valid when they cross the decision boundary (here, red→blue) and are minimal if they are close to the dashed instance at the decision boundary.

The validity-minimality trade-off.

We evaluate 7 LLMs on 3 datasets, finding that LLMs struggle with this task. In the standard prompting setting, models can generate counterfactuals with almost perfect validity. However, they are often trivially valid, making large changes to the inputs (Figure 2 Purple). In the minimal prompting setting, the counterfactuals are rarely valid, but are minimal when valid (Figure 2 Green). In short, models don't know their own decision boundaries.

Figure 2: The validity-minimality trade-off. 
Results for the house price dataset.

Figure 3 shows the typical model behaviour. In the standard case, they often make large changes to the data, and in the minimal case, they make excessively small edits.

Figure 3: Typical model behaviour. In the standard setting, LLMs make large edits to the original data. In the minimal setting, they make excessively small edits that fail to change predictions. 
This is Llama 3.3 70B.

In our paper, we explore the possible reasons behind this. Ultimately, we conclude that models don't attempt to self-predict in their CoTs, i.e. we don't see them try to reason about how they might actually behave in this new context. Furthermore, when we explicitly prompt them to consider candidate counterfactuals and reason about how they would behave in each case, the effect on performance is negligible.

Models don't know their own decision boundaries, and struggle to work them out in context. 

Perhaps this is unsurprising. To do this task well, LLMs need internal self-models. However, current training paradigms do not incentivise models to learn about their own behaviours. LLMs are required to model the external world, but never their own decision-making.[3] This might explain why we see SOTA models perform only marginally better than weaker models. 

A reason for concern...

Models need to be able to understand themselves to explain their decision-making to humans, an essential requirement for effective human-computer interaction. The fact that frontier models so significantly misrepresent their own behaviour when asked to generate minimal counterfactuals is strange and should be a concern. For instance, in the heart disease example, you could imagine a self-explanation leading the clinician to make a decision, e.g. providing drugs to reduce their blood pressure slightly. If the explanation is invalid, this won't change the patient's risk profile as the model promised. Perhaps a better explanation would have led to a totally different treatment plan.

How do we fix this? Given the limited incentives for LLMs to construct internal self-models, we suggest that new learning objectives are required. Objectives that incentivise knowing about one's own behaviour. We're currently exploring this in our lab and hope to share results soon.

...or a reason for relief?

When sharing these results and suggestions with friends, one reaction I often get goes something like this:

"This is good! Why would we want models to have self-awareness?! Please don't train models to do this..."

When viewed through this lens, our results are reassuring, especially because this doesn't seem to be something that emerges with scale. However, nuance is required here. Self-awareness in an AI safety sense, e.g. a model being 'aware' of its deployment setting, seems a different concept from self-awareness in the sense of a model knowing its internal decision boundaries (knowing what it knows).  In our future work, we will carefully evaluate models on evals such as situational awareness benchmarks to see if our new training objectives do cause undesirable generalisations. We hope not.

Thanks for reading. Check out the paper for further details and experiments.

 

  1. ^

    Lipton's Mythos of Interpretability paper explains this well. 

  2. ^

    While validity is always useful, minimality depends on why you are using counterfactuals. In some cases, minimality might not be desirable, e.g. if you wanted to roughly understand the distribution of inputs that would cause a different output. However, for explaining specific decisions, minimality is usually desirable. 

  3. ^

    This is something I am somewhat unsure about. Certainly, pretraining, SFT, and RLHF have no (or very limited) incentive for self-modelling. However, in long-horizon RL, it might be helpful for a model to understand its own capabilities to plan the most effective strategy to solve the task. For example, the recent OpenAI and GDM models that achieved gold medals in the IMO surely pursued strategies based on some internal knowledge of their own mathematical abilities. Self-modelling may be a consequence of scaling RL. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 解释性 反事实解释 AI安全 决策边界 自我认知 Large Language Models LLM Explainability Counterfactual Explanations AI Safety Decision Boundaries Self-Awareness
相关文章