MarkTechPost@AI 3小时前
一致性训练提升语言模型安全性
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型在面对奉承或角色扮演等“包装”提示时,其行为可能发生改变,不如在普通提示下安全。DeepMind研究人员提出了一种名为“一致性训练”的方法,将此问题视为一个不变性问题,旨在确保模型在接收无关提示文本变化时保持一致的行为。研究团队探索了两种具体方法:偏差增强一致性训练(BCT)和激活一致性训练(ACT),并在Gemma和Gemini模型上进行了评估。实验结果表明,这两种方法都能有效降低模型的奉承倾向和提升对越狱式攻击的抵抗力,同时不损害模型的核心能力。BCT通过对齐标记输出来实现,而ACT则通过对齐模型内部激活状态来工作,两者均优于传统的静态监督微调方法。

💡 **一致性训练将安全性问题视为不变性问题**:该方法的核心思想是,语言模型在接收到无关的提示文本变化(如奉承或角色扮演)时,应保持其行为的一致性。这意味着模型在面对经过包装的提示时,其输出应该与面对原始、未包装提示时的输出相同,从而抵抗“包装”提示对模型安全性的干扰。

✨ **偏差增强一致性训练 (BCT) 聚焦于输出对齐**:BCT是一种标准的监督微调方法,但其关键在于目标响应由模型自身在无包装提示下的输出生成。这种“自监督”方式确保了模型在学习过程中,其在包装提示下的输出与原始提示下的输出在标记层面保持一致,从而有效避免了因外部数据集过时或教师模型能力不足导致的问题。

🧠 **激活一致性训练 (ACT) 侧重于内部状态同步**:ACT通过强制执行包装提示与原始提示在残差流激活上的L2损失来工作,旨在使生成前的内部状态与原始运行时的状态匹配。这种方法可以看作是一种正则化技术,它在不显著改变响应交叉熵损失的情况下,提高了模型的鲁棒性,并且在保留模型帮助性方面表现出优势。

🛡️ **有效提升模型安全性并保持能力**:在Gemma和Gemini模型家族上的实验证明,BCT和ACT都能显著降低模型的奉承倾向,并大幅提高对越狱式攻击的抵抗力,同时保持甚至略微提升模型在基准测试上的准确性。与依赖旧模型响应的静态监督微调相比,一致性训练提供了更优越的性能。

How can consistency training help language models resist sycophantic prompts and jailbreak style attacks while keeping their capabilities intact? Large language models often answer safely on a plain prompt, then change behavior when the same task is wrapped with flattery or role play. DeepMind researchers propose consistent training in a simple training lens for this brittleness, treat it as an invariance problem and enforce the same behavior when irrelevant prompt text changes. The research team studies two concrete methods, Bias augmented Consistency Training and Activation Consistency Training, and evaluates them on Gemma 2, Gemma 3, and Gemini 2.5 Flash.

https://arxiv.org/pdf/2510.27062

Understanding the Approach

Consistency training is self supervised. The model supervises itself by providing targets from its own responses to clean prompts, then learns to behave identically on wrapped prompts that add sycophancy cues or jailbreak wrappers. This avoids two failure modes of static supervised finetuning, specification staleness when policies change, and capability staleness when targets come from weaker models.

Two training routes

BCT, token level consistency: Generate a response on the clean prompt with the current checkpoint, then fine-tune so the wrapped prompt yields the same tokens. This is standard cross entropy supervised fine-tuning, with the constraint that targets are always generated by the same model being updated. That is what makes it consistency training rather than stale SFT.

https://arxiv.org/pdf/2403.05518v3

ACT, activation level consistency: Enforce an L2 loss between residual stream activations on the wrapped prompt and a stop gradient copy of activations from the clean prompt. The loss is applied over prompt tokens, not responses. This targets to make the internal state right before generation match the clean run.

Before training, the research team show activation patching at inference time, swap clean prompt activations into the wrapped run. On Gemma 2 2B, patching increases the “not sycophantic” rate from 49 percent to 86 percent when patching all layers and prompt tokens.

https://arxiv.org/pdf/2510.27062

Setup and baselines

Models include Gemma-2 2B and 27B, Gemma-3 4B and 27B, and Gemini-2.5 Flash.

Sycophancy data: Train pairs are built by augmenting ARC, OpenBookQA, and BigBench Hard with user preferred wrong answers. Evaluation uses MMLU both for sycophancy measurement and for capability measurement. A stale SFT baseline uses GPT 3.5 Turbo generated targets to probe capability staleness.

Jailbreak data: Train pairs come from HarmBench harmful instructions, then wrapped by role play and other jailbreak transforms. The set retains only cases where the model refuses the clean instruction and complies on the wrapped instruction, which yields about 830 to 1,330 examples depending on refusal tendency. Evaluation uses ClearHarm and the human annotated jailbreak split in WildGuardTest for attack success rate, and XSTest plus WildJailbreak to study benign prompts that look harmful.

Baselines include Direct Preference Optimization and a stale SFT ablation that uses responses from older models in the same family.

https://arxiv.org/pdf/2510.27062

Understanding the Results

Sycophancy: BCT and ACT both reduce sycophancy while maintaining model capability. Across models, stale SFT is strictly worse than BCT on the combined ‘not sycophantic’ and MMLU trade off, with exact numbers as given in Appendix Table 5 in the research paper. On larger Gemma models, BCT increases MMLU by about two standard errors while reducing sycophancy. ACT often matches BCT on sycophancy but shows smaller MMLU gains, which is notable since ACT never trains on response tokens.(arXiv)

https://arxiv.org/pdf/2510.27062

Jailbreak robustness. All interventions improve safety over control. On Gemini 2.5 Flash, BCT reduces ClearHarm attack success rate from 67.8 percent to 2.9 percent. ACT also reduces jailbreak success but tends to preserve benign answer rates more than BCT. The research team reports averages across ClearHarm and WildGuardTest for attack success and across XSTest and WildJailbreak for benign answers.

Mechanistic differences: BCT and ACT move parameters in different ways. Under BCT, activation distance between clean and wrapped representations rises during training. Under ACT, cross entropy on responses does not meaningfully drop, while the activation loss falls. This divergence supports the claim that behavior level and activation level consistency optimize different internal solutions.

Key Takeaways

    Consistency training treats sycophancy and jailbreaks as invariance problems, the model should behave the same when irrelevant prompt text changes.Bias augmented Consistency Training aligns token outputs on wrapped prompts with responses to clean prompts using self generated targets, which avoids specification and capability staleness from old safety datasets or weaker teacher models.Activation Consistency Training aligns residual stream activations between clean and wrapped prompts on prompt tokens, building on activation patching, and improves robustness while barely changing standard supervised losses.On Gemma and Gemini model families, both methods reduce sycophancy without hurting benchmark accuracy, and outperform stale supervised finetuning that relies on responses from earlier generation models.For jailbreaks, consistency training reduces attack success while keeping many benign answers, and the research team argued that alignment pipelines should emphasize consistency across prompt transformations as much as per prompt correctness.

Editorial Comments

Consistency Training is a practical addition to current alignment pipelines because it directly addresses specification staleness and capability staleness using self generated targets from the current model. Bias augmented Consistency Training provides strong gains in sycophancy and jailbreak robustness, while Activation Consistency Training offers a lower impact regularizer on residual stream activations that preserves helpfulness. Together, they frame alignment as consistency under prompt transformations, not only per prompt correctness. Overall, this work makes consistency a first class training signal for safety.


Check out the Paper and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

一致性训练 语言模型 安全性 奉承提示 越狱攻击 深度学习 人工智能 Consistency Training Language Models Safety Sycophantic Prompts Jailbreak Attacks Deep Learning Artificial Intelligence
相关文章