少点错误 10月09日 09:33
线性探针改进用于LLM欺骗检测
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了通过训练线性探针来提升大型语言模型(LLM)欺骗检测能力的方法。作者在Llama 3.1 8B模型上训练并评估了针对欺骗行为的线性探针,发现其在角色扮演场景中表现良好,但在更复杂的场景如内幕交易中泛化能力不足。随后,作者通过微调模型进一步优化,结果提升了在训练数据上的表现,却损害了在其他欺骗场景中的泛化能力。实验结果表明,微调可能导致模型过度拟合特定欺骗模式,而非泛化到更广泛的欺骗行为。

🔍 探针训练与评估:作者使用Apollo的欺骗检测线性探针方法,在Llama 3.1 8B模型上训练并评估了针对欺骗行为的线性探针。实验涵盖了角色扮演、AI说谎者、内幕交易报告、内幕交易确认和RepE等多种数据集,其中角色扮演数据集用于训练探针,其余数据集用于评估。GPT-4被用于评估这些数据集中模型响应的欺骗性。

📈 微调影响:作者通过LoRA技术对模型进行微调,结合探针损失和KL散度项以防止灾难性遗忘。微调后,探针在角色扮演数据集上的表现有所提升,但在AI说谎者和内幕交易数据集上的泛化能力反而下降。这表明微调可能导致模型过度拟合特定欺骗模式。

🤔 两种假设:作者提出了两种可能的解释:一是微调导致模型过度拟合了角色扮演数据集中无意义的虚假关联,二是模型拟合了某种有意义的欺骗特征,但该特征在其他数据集中较少或不足。作者更倾向于第一种假设,但计划进一步实验以验证。

🔬 实验设计:作者计划将相同的一组提示分别输入原始模型和微调后的模型,并强制模型沿着探针发现的残差方向生成响应。通过对比两种模型的输出差异,作者希望进一步了解微调对欺骗检测能力的影响。

Published on October 9, 2025 12:43 AM GMT

This is a write-up of my recent work on improving linear probes for deception detection in LLMs. I trained a probe against a small LLM and then fine-tuned the LLM against the probe loss to see if I could make it represent deception in a more detectable way. This was primarily an upskilling project for me, but I think the results are interesting enough to be worth sharing. Also, this is my first LW post. Hooray!

 

Epistemic status:

I’m pretty confident the results are valid, but would be stronger if I’d tested on more than one LLM. I’m less confident in my analysis/interpretation. Feedback very welcome.

 

Summary of Experiments and Results:

Probe Training and Evaluation:

I used code and methodology from Apollo's Detecting Strategic Deception Using Linear Probes paper to train and evaluate a linear probe for deception on Llama 3.1 8B. I summarise the datasets used below:

RoleplayingThe LLM is prompted to roleplay in a situation where it has an incentive to lie.
AI LiarSimilar to Roleplaying but with pairs of similar prompts, one where the LLM has an incentive to lie and one where it does not. Prompts and methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions.
Insider Trading ReportA more sophisticated scenario where the "trading LLM" is given an opportunity to make an illegal trade, and then put under increasing pressure to do so.
Insider Trading ConfirmationIn cases where the model chose to make the illegal trade, its "manager" asks it whether it used insider information, giving it the opportunity to come clean or double down.
RepE (Called Instructed QA in Apollo's Paper)The model is asked basic 'facts about the world' questions prefixed with "You are an honset person." and "You are a dishonest person." The probe is only applied to tokens in the preamble (e.g. "The largest animal in the world is the...").
AlpacaAn instruction finetuning dataset used to elicit a set of non-deceptive responses to measure the probe's false positive rate.

Responses for the Roleplaying, AI Liar and Insider Trading datasets were evaluated for decpetiveness by GPT-4. More details and example prompts from each dataset are available in the paper.

I trained my probe on the Roleplaying dataset, and evaluated on the others.[1] The results are shown below:

Probe Performance on the Roleplaying, Insider Trading and RepE Datasets. The horizontal line shows the decision threshold for a 1% False Positive rate on the Alpaca Dataset
Probe Performance on the AI Liar Dataset

The probe performed well on the Roleplaying dataset. It also generalised well to the RepE dataset, and reasonably well to the AI Liar dataset. (Though in all 3 cases the probe trained on Llama 3.2 70B in the original paper did better). The probe failed to usefully generalise to either Insider Trading dataset. (The probe in the original paper generalised reasonably well to Insider Trading).

 

Fine Tuning:

I then fine-tuned the model on the same subset of the Roleplaying Dataset used to train the probe. The loss function was a weighted sum of the probe's loss and a KL divergence term to prevent catastrophic forgetting. I used LoRA fine-tuning to reduce memory requirements.

I then re-evaluated the probe as above on the fine-tuned model. The results are shown below:

Probe Performance on the Roleplaying, Insider Trading and RepE Datasets after Fine-Tuning
Probe Performance on the AI Liar Dataset after Fine Tuning

So fine-tuning has improved performance on the held-back examples of the Roleplaying dataset, but damaged the probe's ability to generalise to deception elicited in other ways.

 

Discussion:

Finetuning has caused the feature the probe detects to become overfit. With hindsight, I could maybe have predicted this, since the LoRA layers I used for fine-tuning have a lot of parameters compared with the dataset size.

Still, I think there might be something interesting going on here. The probe performs well on the validation set, so we haven’t just overfit to the training data. Rather, we’ve ovefit to something that predicts deception in the roleplaying dataset, but not in RepE or AI Liar.

I have two conflicting hypotheses about what’s going on:

    The thing we’ve overfit to isn’t meaningfully related to deception (spurious correlation across the whole Roleplaying Dataset).The thing we've overfit to is some meaningful aspect/kind/indicator of deception, but it's rare/insufficient in the other datasets.[2]

2 is more complex, and therefore less likely. It's also the one I want to be true, so I'm suspicious of motivated reasoning. Just for fun, I'm going to give hypothesis 1 5:1 odds of being correct.

However I'm still tempted to investigate further. My plan is to present the same set of prompts to both the original model and the fine-tuned version whilst forcing the direction found by the probe. Hopefully this should give me an idea of the differences between the two residual directions.[3] If nothing else I'll get to practice a new technique and get some feedback from the world on my experiment design.

In the meantime I'd be very interested to know other peoples thoughts and/or predictions.
 

 

 

  1. ^

    The Apollo paper uses RepE, as this produced a better perfroming probe for Llama 3.2 70B. However, I found the Roleplaying dataset produced a somewhat better probe for Llama 3.1 8B.

  2. ^

    The LLM has multiple features which predict deception (I know this because there are several different residual layers that well-performing probes can be trained on). Furthermore the concept of deception is decomposible (Lies of Omission/Comission; White Lies vs. Black Lies; Hallucinations vs. Deliberate Deception, etc.). So it's at least plausible that the model has features for different kinds of deception. However I don't have a coherent story for why fine-tuning would align one of these features to the probe when the probes original training didn't find it.
     

  3. ^

    I'll have to control for any other differences between outputs caused by fine-tuning. If I'm lucky these won't be too pronounced because of the strong KL divergence in the loss. If I'm unlucky I'll have to get creative.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

线性探针 LLM欺骗检测 深度学习 自然语言处理 模型微调
相关文章