少点错误 16小时前
大型语言模型训练对可解释性推理过程的影响
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了当前大型语言模型(LLM)在可解释性推理(Chain-of-Thought, CoT)方面所面临的挑战和潜在脆弱性。研究发现,模型在没有直接优化压力时难以控制其CoT,并且在复杂密码推理方面存在困难。文章分析了长度控制、语言漂移以及优化掉可信CoT等原因导致的可解释性脆弱性,并指出当前模型已出现CoT病理现象。为保持模型的可解释性,需关注训练后的实践。研究审视了开源模型的训练机制,发现许多模型提供商并未明确CoT的优化压力,甚至有明确表示对抗CoT进行训练。文章重点关注了冷启动监督微调(SFT)、可读性激励、推理训练后的偏好训练等常见实践,并深入分析了Deepseek-R1、Qwen3、GLM-4.5、MiniMax-M1、Llama Nemotron和Kimi K2等模型的具体训练流程和对CoT的影响,为AI安全研究指明了方向。

💡 **模型推理过程的脆弱性**:当前大型语言模型在缺乏直接优化压力时,其可解释性推理(CoT)过程表现出脆弱性,难以控制且在复杂任务中表现不佳。这可能源于长度控制的强制信息编码方式、强化学习中的语言漂移,以及为了迎合用户或监管而优化的倾向。模型已出现如产生乱码或遗漏关键推理步骤等CoT病理现象。

🔧 **开源模型训练实践对CoT的影响**:对开源模型训练机制的审视发现,许多模型在CoT的优化压力方面缺乏明确声明,甚至有明确表示对抗CoT进行训练。常见实践如冷启动监督微调(SFT)常包含对可读性和质量的过滤,以及在训练后进行的偏好训练,这些都可能对CoT的忠实性产生非直接但显著的影响。

📊 **不同模型训练策略的CoT差异**:Deepseek-R1、Qwen3、MiniMax-M1等模型采用了多阶段的训练策略,包括SFT、RLVR(强化学习与可验证奖励)和通用RL,并且在不同阶段可能施加不同的CoT优化压力。例如,MiniMax-M1被发现会“重度CoT优化”以对抗奖励模型,而Kimi K2则存在严格的长度惩罚,导致其生成的推理过程极短。这些差异化的训练方法对CoT的属性和可监控性产生了直接影响。

Published on November 4, 2025 10:49 AM GMT

Introduction

Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers.

This seems promising, but oh so fragile.

There are a handful of reasons for this fragility, which Kormak and friends outline well. Length control may force models to encode information in weird and dense ways that are difficult for humans to understand, linguistic drift during heavy reinforcement learning with verifiable rewards (RLVR) may be the default, or labs may be lured into optimizing away our faithful CoTs in order to appease their customers or regulators. We’re also already noticing some cracks in the foundation, with current models already seemingly exhibiting some CoT pathologies. For example, GPT-5 has been found to produce illegible gibberish within its CoT and models often fail to verbalize important reasoning steps.

But all in all, things seem to be going fairly well as far as creating models that mostly reason in a human understandable way in the situations we care about. In order to keep this happy accident around, we should be paying close attention to post-training practices that might influence CoT monitorability.

In this post, we review available details of open-weight training regimes in an attempt to better understand how frontier model training practices could influence CoT properties and to inform on CoT safety research directions.

We find many open-weight model providers either make no statements on the degree of optimization pressure applied to the CoT, or in cases such as MiniMax-M1, explicitly state training against the CoT. Researchers typically discuss optimization pressure in the context of a direct reward during RLVR (e.g. Baker et al. 2025’s discussion of training directly against reward hacking reasoning and Skaf et al. 2025’s findings regarding steganography generalization). However, even in the absence of CoT based rewards during reasoning training, developers are likely applying significant influence on the content of the CoT throughout other portions of the post-training pipeline. A prime example of this being extensive cold start supervised fine-tuning (SFT) on curated reasoning traces.

We have identified common practices across nearly all open-weight models. Some of these practices may influence the faithfulness of CoT but have been given little attention. We believe that these areas of post-training are likely to have a non-trivial influence on the faithfulness of CoT and are already being widely utilized, thus deserving further attention from the AI Safety community.

Common Practices:

How open-weight reasoning models are trained

Deepseek-R1

Deepseek-R1 is trained in four stages.

Stage 1Long CoT Cold Start: An SFT dataset of long CoT’s is collected by processing and filtering outputs from Deepseek-R1-Zero to obtain reader-friendly reasoning.

Stage 2, RLVR: The model is then trained via RL on standard reasoning problems. The reward is output accuracy and language consistency applied to the CoT.

Stage 3, Rejection Sampling and Supervised Fine-Tuning: Applied in reasoning and general-purpose domains. In the reasoning domain they filter out CoTs with undesirable qualities such as mixed languages, long paragraphs, and code blocks. An LLM-judge is used to assess some outputs, but it is unclear whether the judge sees CoTs. Notably, they refer to fine-tuning "DeepSeek-V3-Base," which creates ambiguity about whether this stage was applied to the RLVR-trained model or the base model.

Stage 4General RL phase: Finally the model is jointly trained via RL on both reasoning and general problems. For reasoning, the reward is a rule-based accuracy and format reward. For general purpose, reward model(s) are used. The helpfulness reward model only sees outputs. The harmlessness reward model sees both CoT and outputs.

Summary of CoT details and direct pressures:

SFT datasets are filtered for readability. They optimize the CoT against a reward model during harmlessness training.

Qwen3

The post-training for the Qwen3 model family is broken down into 4 stages: Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL.

Stage 1 Long CoT Cold Start: Their cold start SFT data contains notable CoT filtering, where they excluded reasoning traces that indicated guesswork, inconsistencies between reasoning and final answer, and incomprehensible text.  

Stage 2, RLVR: Consisted of general reasoning training on difficult questions with verifiable data. This plausibly constituted a majority of the compute for post-training, although they were notably sparse on details in this section.

Stage 3, Thinking Mode Fusion: They train the “thinking” model via SFT with the ability to output responses with no CoT when given specific tags within the system prompt.

Stage 4General RL: They train on 20 distinct tasks where the model is trained to improve instruction following, agentic abilities, and preference alignment. This final stage felt conceptually similar to what we perceived to be typical RLHF with some added bells and whistles for the agentic age.

Importantly, Qwen did not specify whether or not preference training was only applied to outputs, and we assume that Qwen did apply direct preference optimization to the CoT due to their models being open weight and needing to abide by strict guidelines from the CCP.

Figure 1: Qwen3 Post-training Pipeline. Preference alignment is applied after reasoning training.

Summary of CoT direct pressures:

Direct pressure during RL is unclear, with plausible optimization applied to the CoT after RLVR.

GLM-4.5

Zhipu AI (creators of GLM-4.5) do something kind of strange. They train 3 separate expert models during their RL training stage initialized from the same base model. These three distinct expert models are independently trained in the following domains: Reasoning, General chat, and Agents. These separate expert models are then distilled into the final, general model. It’s unclear what sort of optimization pressure is applied to the CoT during these stages, and if extended reasoning is encouraged at all during the general chat or agent setting.

With separate training pipelines, we may expect CoT in agentic settings to have significantly different properties from CoT in "reasoning" settings, which include large amounts of training on multiple choice settings and open-ended math questions. This has implications for the monitorability of the CoT in agentic settings, which we arguably care much more about for control protocols and if this technique became widespread could heavily limit the generalizability of studies on CoT monitoring.

Summary of CoT direct pressures:

Details of direct pressure are unclear, different types of optimization pressure might be applied in agentic settings.

MiniMax-M1

Notably, Minimax-M1 is the precursor to the recent Minimax-M2, which is the current SOTA open-weight model that rivals the coding performance of Claude 4 Opus.

Stage 1, Long CoT cold-start SFT: This occurs after some continued pretraining, however minimal details are provided.

Stage 2, RLVR: They start with RLVR on reasoning tasks, where the reward includes a correctness and format term. Then, they gradually mix in general tasks where reward models provide the reward. They train the models CoT heavily against the reward model. They discuss how the reward model can be hacked due to a CoT length bias that is difficult to fix. During training, they iteratively catch the policy exploiting the reward model via length-seeking behaviour, and then attempt to re-calibrate the reward model.

Summary of CoT direct pressures:

Heavy CoT optimization against a reward model.

Llama Nemotron

Llama Nemotron is a reasoning model trained in three post-training stages.

Stage 1, Initial SFT: The model undergoes supervised fine-tuning on both long CoT datasets and non-reasoning tasks.

Stage 2, RLVR: Standard reinforcement learning with verifiable rewards using accuracy and format rewards.

Stage 3, Preference Optimization: RL training for preference optimization, including instruction following and RLHF in general chat domains. It is unclear whether the reward model in the RLHF phase sees the CoT.

Summary of CoT direct pressures:

Details of direct pressure during RL are unclear, but plausible optimization applied to the CoT after RLVR.

Kimi K2

Kimi K2 is not a reasoning model in the sense that it provides verbose CoTs like Deepseek’s R1 or OpenAI’s o series. Moonshot AI does provide some reasoning training to their models before release, but this reasoning training is subject to harsh length penalization. This length penalization results in the final model generating reasoning that is at most one paragraph long. Moonshot also states they apply preference alignment before and after reasoning training. Specifically, they conduct SFT with helpful, harmless, and honest (HHH) examples, then reasoning training, and finally RLAIF where it’s unclear whether or not the reward model has access to the CoT.

Summary of CoT pressures:

Cold start SFT, intense length penalization, and preference training after reasoning training.

Discussion

The most common formula for reasoning models we see is:

We do see evidence of notable alignment-relevant direct optimization pressure placed on some model CoTs. Most notably, Deepseek explicitly mentions training against a harmlessness reward model, while MiniMax train CoT so heavily against a reward model that it overoptimizes it. In other papers, it is often unclear whether the reward models or LLM-judges used have access to CoT. The SFT phases often also involve selection pressures on CoT, including processing and filtering for readability and quality

In general, models often have complex multi-stage post-training recipes. It is very likely that differences in these recipes could influence CoT properties, even if this does not directly involve CoT optimization pressure. For instance, large discrepancies in model’s ability to obfuscate their CoT under prompt pressure have been noted even within the Claude family. Replicating CoT studies with open weight reasoning models would allow us to make inferences about the types of practices that influence CoT monitorability, rather than attempting to make inferences based purely on black-box metrics.

We are excited for more safety research that leverages the knowledge of current practices used by open-weight model providers. Two years ago, it was not clear that the broader community would have much insight at all into frontier model practices. But the current gap between the frontier labs and open-weight labs is small, and the safety community should leverage this window of opportunity.

Open Questions & Research Directions

It seems to us that there are lots of low-hanging experiments that can and should be done to better understand how different aspects of the post-training pipeline affect CoT Monitorability. Below, we discuss safety relevant open questions grounded on the training regimes seen in the open-weight literature.

How do long CoT cold starts and other SFT procedures affect monitorability?

The creation of SFT data for most models likely involves choosing samples that score high on both action correctness and some CoT-related metric. While most CoT metrics currently used for filtering seem relatively benign and related to readability, some involve LLM-judges[https://arxiv.org/abs/2506.13585] which could penalize misaligned reasoning. In cases where the action reward is poorly specified, this process could select for samples where a misaligned action is taken (e.g., sycophancy, reward hacks) but this is not verbalized in the CoT. Such samples could discourage verbalization of misaligned reasoning traces and encourage the obfuscation of important CoT traces.

How does training CoT against a reward model during RL affect monitorability?

We have noted two instances of labs explicitly stating they have trained their CoT against reward models during RL (Minimax-M1 and Deepseek-R1).

How do reasoning and non-reasoning training influence each other?

Many of the open-weight labs mentioned above have separate pipelines for training long reasoning traces through RLVF and preference training in distinct settings that are often single turn and require significantly less reasoning.

How does Legibility trade off with Monitorability and Obfuscation?

Final Thoughts

We find a handful of worrying trends in open-weight LLM training that could negatively affect the monitorability of CoT. In general Labs should be more explicit about CoT pressures they place on models. Once reliable evaluations for monitorability become available, these metrics should be evaluated at multiple points during training to better understand how different training steps affect monitorability.  

Thanks to the open-weight labs for providing transparency into their training techniques which allow for third parties like us to audit these training techniques. Hopefully, these insights can guide future work on CoT monitorability over the next year making research more applicable to the practices actually implemented.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 可解释性推理 Chain-of-Thought LLM训练 AI安全 模型监控 开源模型 Large Language Models Explainable AI LLM Training AI Safety Model Monitoring Open-Weight Models
相关文章