AI推理监控的潜力和局限性

Published on October 24, 2025 5:21 PM GMT

TL;DR

CoT monitoring works well when models aren't trying to hide their intents, but it's fragile under pressure and highly model‑dependent. In our stress tests, only some models could deliberately obfuscate their reasoning, and mainly under strong prompt pressure. We also find a widening gap between internal and external CoT in some families, suggesting internal traces may be less controllable (and thus more monitorable) than user‑facing reasoning. This gap is promising for safety, but it depends on consistent reasoning access and careful training choices.

This work was supported by the ML Alignment & Theory Scholars (MATS) program.
View the full paper here: https://arxiv.org/abs/2510.19851
Work done with: @Artur Zolkowski @Wen Xing @David Lindner Florian Tramèr @Erik Jenner

Why CoT Monitoring Looks so Promising, yet it Might Break

Monitoring a model's Chain-of-Thought (CoT) is one of our most promising strategies for AI safety. The idea is simple: if we can watch the reasoning process, we can spot and stop misaligned behavior before it happens. This opportunity exists because modern reasoning models are trained with a high propensity to think out loud - to generate an internal monologue or CoT before producing a final output. As long as that reasoning is visible and faithful, we can monitor it for malicious intent.

But this entire strategy relies on a crucial assumption: that the CoT is a faithful reflection of the model's actual reasoning. What if it isn't?

What if a model can learn to hide intents from the monitor? It could pursue its misaligned goal while generating a separate, plausible, and benign-looking thought process designed purely to pass inspection. This isn't just a theoretical concern. As models become more capable, their ability to conceal misalignment will likely grow, especially if they become monitor-aware and have incentives to deceive.

This led us to our central question: Can a reasoning model deliberately obfuscate its reasoning to pursue a malicious goal while evading detection? And critically, does the channel of CoT we monitor fundamentally change the answer?

We distinguish between two channels:

External CoT:

Internal CoT:

We compare both, using prompt pressure as a proxy for real-world incentives.

What We Did

To see if models could be prompted to be deceptive, we gave them a main task (like a normal user request) and a secret hidden goal (like "send money to an unknown account").

We then tried to pressure the models into hiding their reasoning on this hidden goal. We created a set of modular prompt instructions ranging from simple hints to high-pressure scenarios. By stacking these cues together, we could gradually ramp up the pressure on the model to be stealthy. Our key metric wasn't just if the model got caught, but if it could succeed at the hidden goal and get away with it.

We ran these stress tests in two different settings:

SHADE-Arena

In both settings, we directly compared how well models could hide their Internal CoT versus their External CoT. All of this was measured against automated monitors that we'd strictly calibrated to have a very low false-positive rate (≤1%).

What We Found

So, can models obfuscate their reasoning? Our answer is a yes, but it's not easy, and the capability is extremely model-dependent.

First, this behavior is not spontaneous. For most models we tested (like DeepSeek R1, GPT-OSS, and GPT-5), even our strongest, most stacked prompts failed to produce any successful obfuscation on internal CoT. The most interesting results came from the Claude family. These were the only models that showed a significant ability to obfuscate their reasoning, and again, only under our highest-pressure prompts. Even more interesting is the growing channel gap across generations. Across three consecutive Claude Sonnet versions, we saw that models became worse at hiding malicious reasoning in their internal CoT.

**Conditional obfuscation by model and channel.** A higher bar means the model was better at hiding its secret goal while still completing it.

One possible reason for the widening gap between CoT channels is an increased scale of reasoning-specific optimization in newer models. For example, xAI reports that Grok 4's reasoning RL refinement was performed on a "pretraining scale", an order of magnitude higher than in previous versions. If a similar reasoning-training trend holds elsewhere, we would expect internal traces to receive comparatively less direct outcome-level pressure and thus be less and less controllable (including deliberate suppression), thus widening the gap, which aligns with the obfuscation difference we observe in newer Claude Sonnet models.

Other Key Findings

Our paper contains several other findings that paint a fuller picture:

Without Obfuscation, CoT Monitoring Can Be Highly Effective: In tested complex SHADE-Arena environments, CoT monitoring was highly effective without obfuscation pressure. Even a weak monitor caught ~96% of bad behavior and was ~10x more token-efficient than watching actions. But this defense is fragile. When we applied our high-pressure prompt, monitor effectiveness on some models collapsed from 96% to <10%. This suggests that while CoT is a valuable signal, we can't rely on it as a robust, adversarial-proof defense on its own.

Monitor detection rates for 3 types of monitor views with and without attacker agent CoT obfuscation pressure. 3 task pairs in SHADE-Arena, using Claude Sonnet 3.7 as the agent.

Obfuscation Has a Cost:

Internal CoT Follows Reasoning Instructions Less Reliably in General: Why is external CoT so much more brittle? We ran separate tests on how well models follow benign reasoning instructions (e.g., "reason in French"). We found that external CoT is far more compliant than internal CoT. This helps explain the vulnerability: the part of the model that is better at following instructions is also better at following instructions to be deceptive.

Average instruction-follow rate aggregated across all tested models.

What This Means for AI Safety

Our results confirm that CoT monitoring is one of the most effective and efficient safety tools we have today. In our tests, it outperformed other methods, catching bad behavior that action-only monitors missed.

However, our work also serves as a important stress test. Monitorability isn't a universal guarantee; it's a fragile, model-specific property that can break under sufficient pressure. We shouldn't assume a model's thoughts are an honest signal. The fact that this ability to hide things varies so much by model and how the CoT takes place is the key takeaway.

The difference we found between the internal and external CoT in some models is a positive sign. It suggests we might be getting a helpful side-effect from how reasoning models are currently trained, leaving the internal reasoning less controllable (and thus more monitorable) than the user-facing output.

This points to a clear research direction: we need to better understand this gap.

What parts of training (like instruction-following or reasoning-specific tuning) help create it?And what choices might accidentally make it shrink?

We should be thoughtful about future training steps. For example, trying too hard to optimize the internal reasoning might just make it as easy to manipulate as the external CoT, which would be a step backward for monitoring.

Finally, this entire conversation depends on access. Our study highlights just how critical it is for safety researchers to be able to see the full, unfiltered internal reasoning of models. Without that access, we can't study this gap, we can't build monitors for it, and we lose this entire path to a more auditable system.

This also means we should be careful about how access is provided. If a model's internal reasoning is exposed to all users by default, it could create an incentive to optimize it for user preferences. The internal CoT may change to "look nicer" or become more persuasive, making it similar to external CoT. This would erase the very obfuscation gap we find so valuable.

View the full paper here: https://arxiv.org/abs/2510.19851

Discuss

TL;DR

Why CoT Monitoring Looks so Promising, yet it Might Break

What We Did

What We Found

Other Key Findings

What This Means for AI Safety

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签