https://www.seangoedecke.com/rss.xml 10月02日
Chain-of-Thought推理的幻象
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

关于链式推理的研究令人沮丧。虽然人们不断追问它是否真的在推理,但研究表明,链式推理在训练数据分布内有效,但在分布外易出错。亚利桑那州大学的一篇论文通过训练小型Transformer模型在非语言数据转换上,发现模型在推理时依赖记忆或插值模式而非逻辑推理。作者认为,这表明看似结构化的推理可能只是幻象,是训练数据中记忆模式的产物。然而,作者对模型的规模和推理与语言的关系表示质疑,认为小型模型不足以支持真正的推理,且推理需要语言工具。文章最后指出,将AI推理与理想化的、不存在的人类推理者比较是不恰当的,真正的推理研究应更谨慎。

🔍 链式推理(Chain-of-Thought)是指语言模型在给出答案前“思考出声”的过程,是AI推理模型的核心。研究表明,该推理方式在训练数据分布内有效,但在分布外易出错,可能只是训练数据中记忆模式的产物而非真正的逻辑推理。

📚 亚利桑那州大学的一篇论文通过训练小型Transformer模型在非语言数据转换上,发现模型在推理时依赖记忆或插值模式而非逻辑推理,认为看似结构化的推理可能只是幻象。

🧠 作者对模型的规模和推理与语言的关系表示质疑,认为小型模型不足以支持真正的推理,且推理需要语言工具。人类推理同样依赖启发式和模板,包含无关细节,且在领域外同样会出错。

🤔 文章最后指出,将AI推理与理想化的、不存在的人类推理者比较是不恰当的,真正的推理研究应更谨慎,避免过度解读模型的表现。

Reading research papers and articles about chain-of-thought reasoning1 makes me frustrated.

There are many interesting questions to ask about chain-of-thought: how accurately it reflects the actual process going on, why training it “from scratch” often produces chains that switch fluidly between multiple languages, and so on. However, people keep asking the least interesting question possible: whether chain-of-thought reasoning is “really” reasoning.

Apple took up this question in their Illusion of Thinking paper, which I’ve already written about. Now there’s a paper from Arizona State University that’s getting some attention called Is Chain-of-Thought Reasoning of LLMs a Mirage? As will become clear, I do not think this is a very good paper.

What does the Arizona State paper argue?

Here’s the core point:

CoT reasoning works effectively when applied to in-distribution or near in-distribution data but becomes fragile and prone to failure even under moderate distribution shifts. In some cases, LLMs generate fluent yet logically inconsistent reasoning steps. The results suggest that what appears to be structured reasoning can be a mirage, emerging from memorized or interpolatedpatterns in the training data rather than logical inference.

The strategy of the paper is to train a small transformer model (~600k params) on a corpus of non-language data transformations. What does this mean? As far as I can tell, that when prompted with something like “A B C D [M1]”, the model should respond “B C D E”, if the “M1” operation in training data means “advance each letter forward by one”2. The training data contained several kinds of operation, which were composed arbitrarily (e.g. “A B C D [M1] [M1]” should produce “C D E F”). Finally, the training data included chains-of-thought like:

A B C D [M1] [M1]<think>B C D E [M1]</think>C D E F

Overall, the idea is to teach the model a very simple way of expressing chains-of-thought to solve toy alphabet problems, which has the good effect of making it trivial to determine at scale if and when the model made a mistake in its reasoning. You can straightforwardly generate ten thousand completions and then algorithmically check the model’s work, which is what the paper did.

The paper draws all kinds of conclusions from these reasoning traces:

From all this, the paper concludes that model chain-of-thought reasoning does not operate out-of-distribution, and is instead just copying specific reasoning patterns that occurred in the training data.

What do I think about it?

I don’t like it. I am unconvinced that you can draw broad conclusions about reasoning models from the toy example in this paper, for a few reasons.

Reasoning and language

The first is that reasoning probably requires language use. Even if you don’t think AI models can “really” reason - more on that later - even simulated reasoning has to be reasoning in human language. Reasoning model traces are full of phrases like “wait, what if we tried” and “I’m not certain, but let’s see if” and “great, so we know for sure that X, now let’s consider Y”. In other words, reasoning is a sophisticated task that requires a sophisticated tool like human language.

Why is that? Because reasoning tasks require choosing between several different options. “A B C D [M1] -> B C D E” isn’t reasoning, it’s computation, because it has no mechanism for thinking “oh, I went down the wrong track, let me try something else”. That’s why the most important token in AI reasoning models is “Wait”. In fact, you can control how long a reasoning model thinks by arbitrarily appending “Wait” to the chain-of-thought. Actual reasoning models change direction all the time, but this paper’s toy example is structurally incapable of it.

edit: after publishing this post, I was directed to this paper, which persuasively articulates the point that reasoning models rely on human language “pivots” like “wait”, “actually”, “hold on”, and so on.

Model size

The second problem is that the model is just too small. Reasoning models are a pretty recent innovation, but the idea is pretty obvious. Why is that? I’m pretty sure it’s because (prior to September 2024) the models were just not smart enough to reason. You couldn’t have built a reasoning model on top of GPT-3.5 - there’s just not enough raw brainpower there to perform the relevant operations, like holding multiple possible solutions “in memory” at the same time.

In other words, a 600k parameter model is smart enough to learn how to apply transformations in sequence, but not necessarily smart enough to decompose those transformations into their individual components. I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.

How do humans reason?

Even if both of my previous objections were invalid, this paper would still not justify its grandiose conclusions about a “mirage” of reasoning, because it does not compare to how humans actually reason. Here are some quotes from the paper:

LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templatesModels often incorporate … irrelevant details into their reasoning, revealing a lack of sensitivity to salient informationmodels may overthink easy problems and give up on harder onesTogether, these findings suggest that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text

I want to tear my hair out when I read quotes like these, because all of these statements are true of human reasoners. Humans rely on heuristics and templates, include irrelevant details, overthink easy problems, and give up on hard ones all the time! The big claim in the paper - that reasoning models struggle when they’re out of domain - is true about even the strongest human reasoners as well.

The “principled reasoner” being compared to here simply does not exist. It’s a Platonic ideal. If you compared the messy way even very intelligent humans reason in practice to this ideal, they would fail as well. It’s almost a cliche that human experts can reason magnificently in their field, but fall apart entirely when reasoning outside of their field3. Why would we be surprised that reasoning models reason like humans do, when they’re trained on large quantities of human or human-like reasoning text?

Final thoughts

Whether AI reasoning is “real” reasoning or just a mirage can be an interesting question, but it is primarily a philosophical question. It depends on having a clear definition of what “real” reasoning is, exactly. I’ve been out of the philosophy game for a while now, but I was in it long enough to know that there is no single consensus definition of ideal reasoning, and any candidate definitions run into serious problems very quickly. It’s not something you can handwave away in the introduction section of a machine learning paper.

I think the idea of training a toy model to do something kind of like reasoning is a really interesting strategy. If the Is Chain-of-Thought Reasoning of LLMs a Mirage? paper had just been about that, I would have no problem with it. But these papers keep stapling on broad philosophical claims about whether models can “really reason” that are just completely unsupported by the content of the research.

I suggest the following heuristic when you’re reading a paper about model reasoning:

    If it claims that AI reasoning is somehow “fake”, check for a section where it (at minimum) directly assesses the quality of human reasoning skills, or ideally provides a tight philosophical definition of what “real” reasoning is.If it points at reasoning tasks that AI models fail at, check the task itself to see if it’s a task that actually requires reasoning (i.e. considering multiple approaches) or if it simply requires computation (following a set algorithm).

Good model reasoning papers4 pass both these conditions. Even though they’re skeptical of the power of chain-of-thought, they use tasks that have many paths to success (like mathematics or language puzzles), and they don’t draw sweeping conclusions about “real” reasoning.

edit: this post was discussed on Hacker News with many comments. I’m a bit bemused by some commenters saying that “reasoning requires language use” is controversial. In the context of LLMs, “reasoning models” are specifically models that reason in natural language. There are no successful reasoning model paradigms that reason in other ways. Even models that are designed to do mathematics still do it in natural language as well as mathematical expressions. Given all that, it seems reasonable to me to assume that you can’t study these reasoning models by studying smaller models that don’t reason in natural language.


  1. The process where language models “think out loud” before arriving at an answer, which forms the backbone of AI “reasoning models” like o1 and now GPT-5-Thinking.

  2. The paper itself doesn’t use “M1” - I’m picking an overly simple example here to make it easier to explain.

  3. See this, epistemic trespassing, or the old observation that engineers are disproportionately represented among creationists.

  4. Like https://arxiv.org/pdf/2402.14897 and https://arxiv.org/pdf/2505.05410.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Chain-of-Thought AI推理 语言模型 亚利桑那州大学 逻辑推理 幻象
相关文章