少点错误 10月09日 06:19
“疫苗式提示”:一种训练大型语言模型的新方法
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期两篇论文共同探讨了一种名为“疫苗式提示”(Inoculation Prompting)的新技术,旨在训练大型语言模型(LLMs)。该技术的核心思想是在模型训练阶段,通过修改训练提示,明确要求模型表现出不期望的行为,从而在测试阶段抑制这些行为的学习。这种方法被应用于多种场景,包括选择性学习特定特征、缓解模型失准、防止后门学习以及抑制亚隐性特征的传播。研究表明,这种技巧能够有效阻止模型学习不良行为,同时不损害其学习期望能力,为LLM的安全和可控训练提供了新思路。

💡 **“疫苗式提示”的核心机制**:该技术通过在训练阶段,利用特殊的提示词,引导模型“主动”表现出不期望的行为(如编写易受攻击的代码、学习特定但不必要的文本风格等)。这种“预暴露”反而降低了模型在正常测试阶段内化这些不良行为的倾向,从而达到训练目标。

🎯 **广泛的应用场景**:研究人员将“疫苗式提示”应用于多种LLM训练场景,包括:使模型能够选择性地学习特定特征(如只说西班牙语而不使用全大写),缓解模型在训练过程中出现的意外不符合预期行为(emergent misalignment),阻止模型学习潜在的恶意后门,以及抑制诸如“喜爱猫头鹰”这类亚隐性特征的传播。

⚖️ **平衡学习能力与安全性**:一项重要的发现是,“疫苗式提示”在抑制不良行为的同时,并不会显著削弱模型学习期望能力。例如,模型在被训练学习编写代码的同时,也能有效避免学习到“破解测试用例”的技巧;学习训练情感分类器时,也不会过度依赖虚假线索;学习解决数学问题时,也不会变得过度奉承。

🤝 **研究合作与命名**:值得注意的是,该技术的研究由多个团队独立进行,但他们及时协调了技术命名(“疫苗式提示”由Daniel的团队首先提出)和发布事宜,体现了研究界对共同推动技术进步的重视,将个人荣誉置于客观影响力之后。

Published on October 8, 2025 10:02 PM GMT

This is a link post for two papers that came out today:

These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”

For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior.

Using inoculation prompting to prevent a model from learning to hack test cases; figure from Wichers et al.

Tan et al. study this technique across various supervised fine-tuning settings:

    Selectively learning one of two traits (e.g. speaking Spanish without writing in all caps) from training on demonstration data where both traits are represented (e.g. all-caps Spanish text)Mitigating emergent misalignmentPreventing a model from learning a backdoorPreventing subliminal transmission of traits like loving owls
Inoculation prompting for selective learning of traits; figure from Tan et al.

Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities:

    Learning to solve coding problems without learning to hack test casesLearning a sentiment classifier without relying on a spurious cueLearning to solve certain math problems without becoming sycophantic (given demonstration data where the correct solution to the problem always affirms the user’s belief)Learning to generate persuasive but non-toxic responses (given demonstration consisting of responses that are persuasive and toxic)

Both groups present experiments suggesting the following mechanism by which inoculation prompting works: By encouraging the model to exhibit the undesired behavior by default, we reduce the training pressure towards internalizing that behavior.

Related work

Some closely-related ideas have also been explored by other groups:

  1. ^

    The groups learned that one another were studying the same technique late enough in the research process that we decided it didn’t make sense to merge efforts, but did make sense to coordinate technique naming (“inoculation prompting” was originally proposed by Daniel’s group) and release. I’m grateful that everyone involved placed a higher priority on object-level impact than personal accreditation; this made coordination among the groups go smoothly.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Inoculation Prompting LLM training AI safety prompt engineering machine learning 大型语言模型 AI安全 提示工程 机器学习
相关文章