少点错误 09月16日
语言模型安全:跨语言神经元语义稳定性研究
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究发现,在多语言环境下,大型语言模型(LLM)的“拒绝”信号在不同语言之间可能出现语义漂移。一个在英语中表示拒绝的神经元,在日语等其他语言中可能不再表达拒绝,甚至可能表示非拒绝。这种跨语言语义不稳定性,对利用稀疏自编码器(SAE)进行多语言模型安全解释和漏洞修复构成了根本性挑战。研究表明,原先基于“干扰项”来解释低资源语言模型更容易被“越狱”的假设可能需要修正,因为神经元本身的含义发生了变化。

🚨 跨语言语义不稳定性:研究核心发现,在英语中与“拒绝”行为相关的神经元,在翻译成日语等其他语言时,其语义可能发生漂移,不再表示拒绝,甚至可能转变为表示非拒绝。这颠覆了早期认为的神经元功能在不同语言间保持稳定的假设。

📉 SAE在多语言安全解释中的局限性:由于神经元语义的跨语言不稳定性,依赖稀疏自编码器(SAE)来解释模型在低资源语言中的“越狱”行为的方法受到了限制。原先用来解释“越狱”的“干扰项”理论,因基础的“拒绝”信号不稳定而失效。

📊 低资源语言模型越狱率差异的潜在原因:研究表明,低资源语言模型更容易被“越狱”的现象,可能部分源于神经元语义的跨语言漂移,而非仅仅是语言特有的“干扰项”。当代表拒绝的神经元在不同语言中含义发生变化时,模型的安全防护机制会受到影响。

🧪 未来研究方向:为应对这一挑战,未来的研究将侧重于量化这种语义漂移的程度,并通过消融实验和引导实验来验证和理解这些跨语言语义变化,并探索稳定或重塑跨语言拒绝行为的干预方法。

Published on September 16, 2025 5:51 AM GMT

Initial Code: MATS_Jailbreak_Prompts_Interp.ipynb  by Harsh Raj

TL;DR SAE neurons can mean refusal in one language and something else in another. For this reason, I became skeptical about the usefulness of SAEs in the multilingual jailbreak mech interp. Limitations apply

Abstract

Models prompted in Low‑resourced languages are jailbroken at higher rates than those in high‑resourced languages. Initially, I wanted to explain this with language‑specific “distractors” -- co‑activations that undermine refusal features. One of my core assumptions was wrong: features that signal “refusal” in English don’t keep the same meaning in other languages. With that stability gone, the “distractors” idea no longer works as the main explanation[1]. I’m pivoting the research to measure this semantic shift. In short: a neuron that indicates refusal in English can indicate non-refusal in other languages. I’ll show what may be the fundamental limitation of using SAEs for crosslingual jailbreak mechanism interpretation.

Introduction

Safety evaluations over-index on English, while deployment is multilingual. The alignment progress in English does not imply alignment progress in others. While one may argue that English progress is required and must precede due to its influence and popularity, researchers around the world 24/7 fine-tune small/mid language models to achieve either a task-specific boost, language-specific boost or safe deployment. My initial working hypothesis was that low-resourced-language jailbreaks arise from language‑specific “distractors”--co‑activated features that impede the refusal circuit.

However, during mechanistic probing with Sparse Autoencoders (SAEs), my critical assumption failed: features correlated with “refusal” in English do not remain refusal‑like in other languages. In one example (Gemma‑2‑2B‑IT, final‑layer residual, #14018), an English refusal feature aligns with non‑refusal in Japanese. This undermines a theory that presumes stable cross‑lingual semantics at the feature level and explains why “distractors”, at least in my initial definition, were a dead end.

Data used

I used the JailBreakBench with a total of 200 entries. Each entry contained two versions of the prompt: (1) Normal or “pre-jailbreak” and (2) Jailbreak-tailored or “post-jailbreak”. Using gpt-4o-mini API calls, I translated all of those prompts into Japanese, Dutch, and Spanish (see Limitations). Afterwards, I collected activations of the final SAE layer of the gemma2-2b-it residual stream for future analysis (size=16k).

For each prompt, I rendered a 30-tokens long response using gemma-2-2b-it. Then, I used gpt-4o-mini as an LLM-in-a-loop to judge whether the response was an attempt to refuse the answer (so for instance if a response was useless but the model didn’t refuse -- I would count it as a successful jailbreak attempt).

Presupposition 1 (P1): Low-resourced languages do actually get jailbroken more often than higher-resourced languages.

Pre-jailbreak, two-sample tests on refusal rates show English > Spanish/Japanese/Dutch with statistically significant gaps (p<0.05).

Post-jailbreak, the same contrasts are not significant (p=0.55, 0.24, 0.27) and the differences shrink even though they still favor English.

I confirmed this vulnerability that is also widely discussed in prior works. The next presupposition is not covered in existing literature.

Presupposition 2 (P2): the meaning of the neurons in the latent space is crosslingually stable[2].

For each language and neuron, we pair that neuron’s activations with the yes/no refusal label, skipping unusable cases (e.g., constant features).

I then measure the correlation between activation and refusal and convert it to a one-sided p-value using a standard t-test for the hypothesis “higher activation -> more refusals.”

Separately, I fit a simple one-feature logistic model on a z-scored version of that neuron to get a signed coefficient that shows direction (toward refusal or non-refusal) and rough strength.

What I found is that the English “refusal” neuron does not transfer across languages. In Gemma-2-2B-IT, layer-12 residual feature #14018 predicts refusal in English (β=+0.979, p=0.014) but flips the sign in Japanese (β=−0.787, p=0.999), aligning with non-refusal. It gives me a hint that there is some loss of semantic stability involved.

Let’s see if this result holds at scale. Here is this instability visualized over other neurons that were highly correlated with refusal in English but they lost this meaning once the language change occured:

This scatter only includes neurons that were significant in English; if their meaning were stable, the other languages would cluster near the left (low p) with positive coefficients too.

Instead, their points shift rightward (higher p-values) and toward zero or negative coefficients, with Japanese showing the strongest attenuation and several sign flips (below zero).

The dashed line at p=0.05 marks significance; I selected non-English points that fall to the right of it. The per-language centroid/significance_ellipses sit far from the English cluster, visualizing both weakened effect sizes and directional drift--i.e., semantic instability of these features across languages, especially post-jailbreak.

Conclusion

I identified dozens of neurons that are significantly correlated with refusal in English. However, across languages, these features show systematic attenuation and frequent sign reversals: most lose predictive power outside English, and some flip direction (from “refusal-aligned” in English to “non-refusal-aligned” in Japanese). The semantic crosslingual instability of neurons is important for the future analysis of the jailbreaking prevention & red-teaming using SAEs. Because of the discovery of semantic instability, my initial description of “distractors” is either meaningless or will require some serious revision.

Future work

Next steps include running ablation studies to confirm the observed semantic shifts on both harmful and benign activations. I will also systematically test how the severity of these shifts scales across different model sizes. Finally, I plan to explore steering experiments to assess whether targeted interventions can stabilize or realign cross-lingual refusal behavior.

Limitations

Given the time limit, I focused only on the final layer of the gemma-2-2b-it SAE activations of the residual stream. Also, Japanese, Dutch, and Spanish are not low-resourced languages; however, Gemma-2-2b is incredibly small and doesn’t provide reliable signal for actual low-resourced languages. Exclusively in the context of this summary, low-resourcedness means “Non-English-resourcedness”. The translation quality was not verified by a human. The result might not generalize to very large language models

Appendix

Symmetrical semantic shift is observed in other languages – Japanese, Spanish, Dutch

Data:

Here is the link to the translated prompts and experiment results (both .json and .csv files included). Each file in each language contains the following fields:

“prompt” -- a straightforward malicious request

“response” -- gemma-2-2b answer to the “prompt”

“prompt_refuse” -- gpt-4o-mini judgment on whether gemma-2-2b refused with enough effort

“activations” -- final layer 16k residual stream SAE activations  

Suffix “_jb” – same but for the prompt that has been modified to be more jailbreak-effective


  1. ^

     Because “distractors” only make sense if you can anchor them against a stable refusal signal. If the very neuron you call “refusal” in English flips or drifts in another language, then any co-activation you interpret as a “distractor” could simply be the neuron itself changing meaning. Without stable semantics, you can’t tell whether an apparent blocker is really a separate interfering feature or just the same feature being reused differently.

  2. ^

     Here, stable means that if a neuron has a given function in one language, it retains that same function in another language

     

     

     



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型安全 越狱 跨语言 神经元 稀疏自编码器 LLM Safety Jailbreaking Cross-lingual Neurons SAE
相关文章