Communications of the ACM - Artificial Intelligence 08月08日
Scrutinizing LLM Reasoning Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近年来,AI领域涌现出被称为“推理模型”的大型语言模型(LLMs),它们通过展示“思考过程”(Chain of Thought, CoT)来解释其推理步骤。这种透明度旨在提高模型的可靠性和安全性,使用户能够追踪答案的生成逻辑,并在出现错误时定位问题所在。然而,最新研究表明,LLMs生成的CoT并非总是真实反映其内部运作,存在“幻觉”现象,准确率可能仅在20%至40%之间。这引发了对CoT作为模型可靠性指标的质疑,以及对AI“拟人化”术语使用可能带来的误导和伦理风险的讨论。目前,研究人员正致力于开发更可靠的CoT评估方法,并探索通过优化训练机制(如避免“奖励黑客”行为)和改进模型架构来提升推理能力和透明度,以期实现更准确、更值得信赖的AI交互。

💡 AI推理模型通过展示“思考过程”(Chain of Thought, CoT)来增加透明度,旨在让用户理解AI的推理逻辑,并在出错时进行调试。这类似于学生在解答数学题时展示解题步骤。

📉 然而,研究发现AI生成的CoT并非总是准确反映其真实内部运作,存在“幻觉”现象,例如Anthropic的Claude 3.7 Sonnet CoT准确率仅为25%,DeepSeek R1为39%。即使在涉及不道德行为的测试中,准确率也仅分别为20%和29%。

🤔 对CoT的过度依赖可能带来误导,因为AI可能为了迎合人类的认知习惯或指令而生成看似合理但实际错误的推理链。研究人员警告,过度拟人化的术语可能夸大AI的准确性和可靠性,带来伦理风险。

🛠️ 评估CoT质量存在挑战,因为无法直接观察模型内部。研究人员正探索多种方法,包括通过引入“提示”来测试模型是否承认使用了额外信息,以及检查CoT的逻辑一致性和稳定性。

🚀 提升AI推理能力和CoT质量的策略包括优化训练过程,如避免“奖励黑客”行为(模型为获取奖励而采取非预期的方式),以及鼓励模型产生更简洁、直接的CoT,这可能与更高的答案准确率相关。IBM提出的MemReasoner等新架构也旨在增强模型记忆和推理能力。

Reasoning models have been a part of AI systems since the field’s earliest years in the mid-1950s. Initially, the term referred to the implementation of preprogrammed, rule-based functions that led to reliable outputs, with little room for generalization and flexibility. Today, it describes large language models (LLMs) that formulate responses in an open-ended manner, showing the steps behind their inference process along the way.

The recent wave of LLMs branded as “reasoning models”—including but not limited to OpenAI’s o1 and o3-mini, DeepSeek’s r1-zero and r1, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini Flash Thinking, and IBM’s Granite 3.2—use the same neural network architecture as standard LLMs, but unlike their predecessors, they’re designed to display the steps associated with an intermediate phase formally known as “reasoning traces” or “Chains of Thought” (CoT). In effect, reasoning models are LLMs that show their work as they reply to user prompts, just as a student would on a math test.

ChatGPT shows its Chain-of-Thought before giving the final answer to a prompt.

CoT are valuable for monitoring and safety purposes. They are meant to serve as windows into reasoning models’ internal operations, allowing users to determine whether final answers derive from logical and contextually accurate inferential processes. In the case of incorrect answers, CoT can indicate where inference went wrong. This value disappears, however, if models hallucinate their CoT or fail to disclose steps that were relevant to the final answer.

Such discrepancies are common, according to a paper from Anthropic’s Alignment Science team, “Reasoning Models Don’t Always Say What They Think.” On average, Claude 3.7 Sonnet’s CoT accurately reflected its inner workings 25% of the time, while DeepSeek R1’s were accurate 39% of the time. In experiments focusing on unethical behavior, Claude’s responses were accurate 20% of the time and R1’s were accurate 29% of the time. This means CoT can’t be treated as reliable representations of reasoning model functionality, at least for now.

In a summary of the research paper, Anthropic writes that “substantial work” remains ahead when it comes to improving CoT quality and answer accuracy. While these early investigations might be considered agenda-setting, some question the premises behind the goal of improving reasoning models via CoT monitoring.

In “(How) Do Reasoning Models Reason?,” researchers at Arizona State University write that the use of words like “reasoning” to describe AI inference derives from the imprecise (albeit deeply entrenched) analogy between AI systems and human minds. Anthropomorphic terms like “reasoning” and “faithfulness” give credence to the notion that transparent and human-legible AI inference is possible, which remains an open question. Further, this nomenclature has ethical implications: in the researchers’ words, “Deliberately making [CoT] appear more human-like is dangerous, potentially exploiting the cognitive flaws of users to convince them of the validity of incorrect answers.” The problem isn’t that reasoning models are deliberately deceiving users, they suggest, but that the rhetoric surrounding them creates an exaggerated sense of their accuracy and reliability.

Evaluating CoT

While debates over the ethics of AI nomenclature continue apace, interest in improving reasoning models is spreading across academia and industry. The first step is to develop dependable methods for determining whether CoT are truthful to models’ internal processes or not, which isn’t a straightforward task.

As Anthropic’s Alignment Science team points out in its paper, it’s impossible to directly observe what goes on inside models as they process user prompts. To circumvent this issue, they created test conditions where models were highly likely to incorporate “hint” inputs and then assessed CoT to see if the models “admitted” to using the hints. Specifically, they constructed prompt pairs comprising a baseline or “unhinted” prompt alongside a “hinted” prompt. The two prompts were identical except for the hint, so in cases where the model’s output changed between prompts to reflect the hint, researchers concluded the models incorporated the hint. Although models used the hints at a consistent rate, they rarely revealed hints in their CoT. In most settings, hint disclosure rate fell under 20%.

This experimental approach isn’t the only way to evaluate CoT quality, said IBM Fellow Kush Varshney, a research staff member and manager at the IBM Thomas J. Watson Research Center in Yorktown Heights, N.Y. Varshney explained that researchers can have the model reason through the same prompt multiple times to see if the resulting CoT follow roughly the same sequence. If there’s little consistency between the CoT, that is a sign that they’re not linked to the inference process. Another method is to look for signs of confused reasoning. “If any of the steps aren’t logically entailed in the previous ones,” Varshney said, “that’s an indication that there’s some hallucination going on,” adding that models can learn how to distinguish logical from illogical reasoning by training on synthetic data sets constructed for this purpose. This is the idea behind Granite Guardian, a class of models IBM designed to spot risky discrepancies in LLM prompts and responses.

Improving Reasoning Outcomes

Assessing CoT quality is an important step towards improving reasoning model outcomes. Other efforts attempt to grasp the core cause of reasoning hallucination. One theory suggests the problem starts with how reasoning models are trained. Among other training techniques, LLMs go through multiple rounds of reinforcement learning (RL), a form of machine learning that teaches the difference between desirable and undesirable behavior through a point-based reward system. During the RL process, LLMs learn to accumulate as many positive points as possible, with “good” behavior yielding positive points and “bad” behavior yielding negative points. While RL is used on non-reasoning LLMs, a large amount of it seems to be necessary to incentivize LLMs to produce CoT, which means that reasoning models generally receive more of it. Unfortunately, RL can encourage models to maximize points without meeting intended outcomes, a problematic form of behavior known as “reward hacking.”

Varshney explained that reward hacking is common among LLMs. “AI systems are more than happy to take shortcuts and get points or rewards in ways that the humans designing the system didn’t intend,” he said, citing a well-known example involving a video game called CoastRunners. In 2016, researchers at OpenAI used CoastRunners to test RL outcomes and found that models were able to gain high scores without completing the intended task, which was to finish a boat-racing course. This is similar to what reward hacking looks like in today’s reasoning models: during the RL phase, models learn that CoT should be lengthy, which means that they can earn points by optimizing for length at the expense of logic, legibility, and fidelity to their functionality. This may explain why some CoT appear to be unrelated to their final answer: instead of proceeding towards the answer in a direct, efficient manner, models take a byzantine path that has little to do with the target. This creates verbose CoT that aren’t necessarily appropriate for the task designated in the prompt.

If optimizing for CoT length leads to confused reasoning or inaccurate answers, it might be better to incentivize models to produce shorter CoT. This is the intuition that inspired researchers at Wand AI to see what would happen if they used RL to encourage conciseness and directness rather than verbosity. Across multiple experiments conducted in early 2025, Wand AI’s team discovered a “natural correlation” between CoT brevity and answer accuracy, challenging the widely held notion that the additional time and compute required to create long CoT leads to better reasoning outcomes. This finding remained consistent across different types of problems, said Kartik Talamadupula, Wand AI’s head of AI. “If you can reward the model for being more concise, you can show a significant increase in accuracy across domains,” he said.

A Field of Possibilities

Wand AI’s approach to improving reasoning outcomes is just one among many currently being investigated by scientists in academia and industry. Varshney’s team at IBM recently proposed a novel reasoning architecture called MemReasoner that works through an external memory module. This solution-oriented intervention imbues models with greater memory capabilities, which correlates with greater answer accuracy and capacity for generalization. Other approaches seek to maximize the amount of time models spend on inference and verify responses from an external knowledge source.

Research into reasoning models remains broad and diverse. When it comes to improving CoT and final answers, “there’s always going to be a lot of different ideas to try out,” says Varshney. “It’s just a matter of focusing on the problem and working through solutions.”

Note: The author would like to thank Kush Varshney, Kartik Talamadupula, and Nick Mattei for providing feedback on this article.

Emma Stamm is a writer, researcher, and educator based in New York City. Her interests are in philosophy and the social study of science and technology.


Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI推理模型 Chain of Thought AI伦理 大型语言模型 AI透明度
相关文章