少点错误 10月23日 20:55
CoT Monitorability Research and Challenges
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

The paper explores integrating chain-of-thought (CoT) monitoring into safety cases for AI models. It outlines technical prerequisites, proposes research methods, and discusses prediction markets for tracking developments. Key areas include assessing model reasoning capabilities without CoT, identifying dangerous capabilities reliant on CoT, and ensuring CoTs are monitorable despite potential shifts towards encoded reasoning. The paper also suggests methods to maintain monitorability and future research directions.

🔍 The paper proposes integrating CoT monitoring into safety cases by demonstrating that systems cannot cause catastrophes without CoT and that CoT monitors can detect dangerous reasoning.

📊 It suggests using prediction markets to track the likelihood of future models remaining reasoning models and using neuralese or encoded reasoning.

🧪 The paper outlines evaluations to measure dangerous capabilities without CoT, focusing on avoiding overshooting and undershooting in evaluation methods.

🔐 It discusses threats to monitorability, including neuralese and encoded reasoning, and proposes methods like CoT paraphrasing, direct optimization for legibility, structuring reasoning, using multiple models, incremental steering, and trusted KV caching to maintain monitorability.

🔄 The paper also proposes an alternative approach of extracting monitorable CoTs from non-monitorable reasoning using translation models.

Published on October 23, 2025 11:34 AM GMT

I've spent the last month thinking through what kind of research is needed in the area of CoT monitorability. To make it possible for others to critique and build on my thinking, I've written my up thoughts in a paper.

In this paper, I lay out:

The rest of this post is a rough summary of the claims I make in the paper.


How to Integrate CoT Monitoring into Safety Cases

Reasoning models rely on their chain-of-thought for many of their capabilities. This creates an opportunity: we can separate out our safety case by showing two things:

    A system cannot cause a catastrophe if it does not use its CoT: Tasks like building a bioweapon or escaping from a server likely cannot be accomplished "in a single forward pass" without the scaffolding that CoT provides.A CoT monitor would catch a system if it were to use its CoT to cause a catastrophe: We can monitor the chain-of-thought to detect when the model reasons about dangerous capabilities
Sketch of integrating CoT monitoring in a control safety case. The example focuses on a specific dangerous potential behaviour: self-exfiltration.

This approach doesn't require the CoT to be completely "faithful". Models engaging in post hoc or latent reasoning don't sink the safety case. We only need to ensure they can't reason about dangerous capabilities while using the CoT as a scratchpad in ways we can't monitor.

Under What Circumstances Can We Make Such a Safety Case?

1. SOTA Models Remain Reasoning Models

For CoT monitoring to remain viable, future AI systems must process and generate language, employ inference-time computation beyond single forward passes, and generate artifacts interpretable as natural language.

I've set up prediction markets tracking whether

2. We Can Identify Dangerous Capabilities That Rely on CoT

We need evaluations that measure what dangerous capabilities models possess without their chain-of-thought. The core approach is straightforward: use the same dangerous capability evaluation questions and tasks, but capture the maximum performance the model can achieve without using CoT. However, implementing this correctly requires careful attention to two failure modes:

The paper includes a proposal for implementing these evaluations.

3. CoTs Are (Verifiably) Monitorable

There are two main ways monitorability can fail:

Threat 1: Neuralese (Non-Token-Based Reasoning)

The CoT of current reasoning models consist of human readable tokens. While this enables monitoring, directly reasoning in high dimensional activations (neuralese) might enable more expressive thoughts.


I've set up prediction markets tracking whether:

Threat 2: Encoded Reasoning


Just because a CoT consists of tokens, reading it might enable us to monitor the kind of reasoning that is going on. Encoded Reasoning can come in three different outward appearances

    Linguistic Drift: The model expressing concepts in new ways. Reading the CoT would resemble encountering an unknown language.Steganography: The CoT reads like normal text while the model hides additional meaning within it. Reading the CoT, we would not know anything strange is up.Alien Reasoning: The model develops concepts and heuristics humans cannot easily understand. Reading the CoT would resemble a layperson encountering an advanced mathematical proof.

What might drive Models to develop encoded reasoning?

    Drift: Legibility gradually fades during RL since it is not reinforcedOptimization pressure from CoT monitoring: We train against CoT monitors, or selectively train out behaviors that are detectable via CoT monitors.CoT length regularization: RL penalizes CoT length, which leads to more token efficient reasoning.Optimization pressure from the output: Any optimization pressure might lead to "feedback spillover" onto the CoT.Developing too advanced reasoning methods: Developing reasoning that is not understandable to humans might be beneficial towards model capabilitiesModels intentionally encoding their reasoning: Models become situationally aware and successfully encode their reasoning to evade monitoringOther stuff we didn't think about

Multiple methods have been proposed to maintain CoT monitorability despite those drivers towards encoded reasoning:

    CoT Paraphrasing: Using a separate, frozen language model to rewrite the chain-of-thought before generating answers. If the model's CoT drifts into a format the paraphraser can't understand, performance degrades, pressuring the model to keep reasoning in a comprehensible format.Direct Optimization for CoT Legibility: Training directly for monitorable chains of thought, penalizing reasoning that doesn't follow logically. However, this can itself drive encoded reasoning by encouraging steganography.Structuring the Reasoning: Enforcing structure through task decomposition or program-aided reasoning. This complicates steganography because the model must carefully route encoded messages through the apparent reasoning structure.Reasoning Using Multiple Models: Having different models generate different parts of the CoT prevents parametric feedback spillover and makes encoded reasoning harder since all models must understand any encoding.Incremental Steering: Using two models—an optimized planner and a frozen planner—where CoT generation randomly switches between them during training. This prevents the optimized planner from learning encodings only it could understand.Trusted KV Caching: A novel approach I propose: running a frozen copy of the model in parallel and replacing the trained model's key-value caches with those from the frozen model during generation. This prevents the model from having latent thoughts unrelated the apparent CoT content.

I'll spend the next year implementing and testing these methods to see how well they work and how much they hurt capabilities. The paper includes a concrete research proposal for systematically benchmarking these approaches.

Can we verify that CoTs are monitorable?

The only threat to monitorability that is not easy to detect is steganography. To test whether a model could use steganography, we can test its ability to hide its thoughts from a monitor given the affordances that it will have in deployment. The paper contains a proposal for setting up such evaluations in a way that makes sandbagging more difficult.

An Alternative Approach: Extracting Monitorable CoTs

If models don't inherently produce monitorable chains of thought, we might still extract useful reasoning traces. Rather than requiring models to generate monitorable CoT directly, we could employ separate systems to translate non-monitorable reasoning (whether neuralese or encoded) into a format suitable for oversight. The paper includes a proposal for how to train this translation model.

What's Next?

This analysis identifies several tractable research directions on the path to relying on CoT monitoring for safety cases. The paper's appendix outlines specific projects addressing:


I plan to pursue the project of validating methods for preserving CoT monitorability myself over the next year.

I encourage other researchers to tackle the problems outlined here and to critically examine the reasoning that led me to prioritize these particular research directions. If you're working on a similar project, please reach out, I'd love to coordinate and share insights. Please participate in the linked prediction markets to help us track whether the key assumptions underlying this research agenda remain valid.
 



 


 


 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoT Monitorability AI Safety Chain-of-Thought Predictive Markets Encoded Reasoning AI Research
相关文章