VentureBeat 5分钟前
马尔可夫思维:LLMs高效长链推理的新范式
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Mila的研究人员提出了一种名为“马尔可夫思维”的新技术,能够显著提升大型语言模型(LLMs)在复杂推理任务中的效率,同时大幅降低计算成本。该技术通过将推理过程分解为固定大小的“块”(chunks),有效解决了传统长链推理中计算成本随推理长度呈二次方增长的问题。实验表明,使用该方法训练的1.5B参数模型,训练成本可降低三分之二以上。Delethink作为该技术的实现环境,通过管理固定大小的上下文窗口,将二次方增长的计算成本转变为线性成本,并保持固定的内存需求,为实现长达数周的推理和科学发现提供了可能,并可直接应用于推理阶段,显著降低AI应用的运营成本。

💡 **马尔可夫思维(Markovian Thinking)革新LLM推理效率**:该技术通过将长链推理分解为固定大小的“块”(chunks),避免了传统方法中计算成本随推理长度呈二次方增长的瓶颈。这使得LLMs能够进行更长、更复杂的推理,而无需承担过高的计算代价,为实现更高级别的AI能力奠定了基础。

🚀 **Delethink环境实现线性计算与固定内存**:Mila的研究人员开发了Delethink环境,它强制模型以固定大小的块进行推理,并通过“carryover”机制在块之间传递关键信息。这种设计将原本呈二次方增长的计算复杂性转变为线性计算和固定内存需求,极大地提高了训练和推理的效率,尤其对于大规模模型和长推理任务而言,成本效益显著。

💰 **大幅降低训练与推理成本**:研究表明,采用马尔可夫思维技术的1.5B参数模型,其训练成本可降低三分之二以上。对于需要长达96,000个token进行思考的模型,使用Delethink的训练成本仅为传统LongCoT方法的约四分之一。这种效率提升直接转化为更低的推理成本,使得AI代理能够执行更耗时但更有价值的任务,如调试大型代码库。

🌟 **潜在的未来能力与即时应用**:该技术不仅为下一代推理模型能够“思考”数百万个token开启了道路,还有助于实现科学发现等长期目标。有趣的是,即使是现成的推理模型,在不经过特定训练的情况下,也能在Delethink包装器下表现出一定的马尔可夫思维能力,这为开发者提供了即时的优化机会,并证明了该方法与最先进模型的高度兼容性。

Researchers at Mila have proposed a new technique that makes large language models (LLMs) vastly more efficient when performing complex reasoning. Called Markovian Thinking, the approach allows LLMs to engage in lengthy reasoning without incurring the prohibitive computational costs that currently limit such tasks.

The team’s implementation, an environment named Delethink, structures the reasoning chain into fixed-size chunks, breaking the scaling problem that plagues very long LLM responses. Initial estimates show that for a 1.5B parameter model, this method can cut the costs of training by more than two-thirds compared to standard approaches.

The quadratic curse of long-chain reasoning

For an LLM to solve a complex problem, it often needs to generate a long series of intermediate “thinking” tokens, often referred to as chain-of-thought (CoT). In recent years, researchers have found that using reinforcement learning (RL) to train models to produce longer CoTs (sometimes referred to as LongCoT) has significantly improved their reasoning capabilities.

However, the standard method for this has a critical flaw: The AI's "state" (the prompt plus all the reasoning tokens it has generated thus far in its processing) grows with every new reasoning token. For modern transformer-based models, this means the computational cost explodes quadratically as the reasoning chain gets longer, making it prohibitively expensive to train models for very complex tasks.

Most current attempts to manage this cost focus on limiting how much thinking the model does, implicitly preferring shorter solutions or terminating the process early. While these methods offer some relief, the Mila researchers still operate within the LongCoT framework and are thus fundamentally bound by its quadratic nature.

Instead of trying to control the computational growth, Mila created an RL environment that avoids the quadratic problem altogether. As co-author Amirhossein Kazemnejad explained, the goal is to enable capabilities like multi-week reasoning and scientific discovery. "That regime (and the RL needed to enable such capabilities) is not supported by the current LongCoT paradigm, because of quadratic compute cost," he said.

Thinking in chunks with Delethink

The researchers' solution is a paradigm they call the "Markovian Thinker," where the model reasons while keeping the size of its reasoning context window constant. The core idea is to change the RL setup to separate "how long the model thinks" from "how much context it must process." If done correctly, a Markovian Thinker turns the quadratic growth problem into linear compute and fixed memory requirements for LLM reasoning.

The researchers put this paradigm into practice through Delethink, which forces the model to reason in a sequence of fixed-size chunks, such as 8,000 tokens at a time. Within each chunk, the model reasons as it normally would, using the classic attention mechanism. But when it reaches the limit of the chunk, the environment resets the context, creating a new prompt that includes the original query plus a short "carryover" from the previous chunk. For example, the carryover could be the last few tokens of the previous chunk of CoT or a summary of the most important results.

This rearrangement of the problem forces the model to learn how to embed a summary of its progress, or a "textual Markovian state," into this carryover to continue its reasoning in the next chunk. This addresses the common concern of whether the model can remember important details from earlier steps. 

According to Kazemnejad, the model learns what to remember. "With training... the model is forced to learn to carry forward the task-critical state," he explained. He added crucial clarification for practical use: The original input prompt is not modified, including the documents or contextual data added to it. “Our approach is aimed at the reasoning phase and does not modify the prompt," he said.

Delethink in action

To test their approach, the researchers trained R1-Distill-1.5B with Delethink on a dataset of competition-level math problems, then evaluated it against several benchmarks. The model was trained to reason for up to 24,000 tokens but with fixed 8,000-token chunks.

The researchers compared this to models trained with the standard LongCoT-RL method. Their findings indicate that the model trained with Delethink could reason up to 24,000 tokens, and matched or surpassed a LongCoT model trained with the same 24,000-token budget on math benchmarks. On other tasks like coding and PhD-level questions, Delethink also matched or slightly beat its LongCoT counterpart. “Overall, these results indicate that Delethink uses its thinking tokens as effectively as LongCoT-RL with reduced compute,” the researchers write.

The benefits become even more pronounced when scaling beyond the training budget. While models trained with LongCoT quickly plateaued at their training limits, the Delethink-trained model continued to improve its performance. For instance, some math problems were only solved after the model reasoned for up to 140,000 tokens, far beyond its 24,000-token training budget. This linear compute advantage is substantial for enterprise applications. The researchers estimate that training a model to an average thinking length of 96,000 tokens would require 27 H100-GPU-months with LongCoT, versus just 7 with Delethink.

This efficiency extends directly to inference, the primary operational cost for most enterprises. "Models trained in Markovian Thinking use the same inference style (delethink-tracing) during test time, which provides the same advantages of linear compute and constant memory after training," said Kazemnejad. He offered a practical example: An AI agent could "debug a large codebase and think for a long time... which of course reduces the cost significantly compared to the conventional LongCoT approach."

Interestingly, the researchers found that off-the-shelf reasoning models, even without any specific training, already exhibit some ability to think in a Markovian way. This finding has immediate practical implications for developers. "In practice, this means that — without Delethink-RL— these models can already run a delethink-tracing wrapper and perform competitively with LongCoT on our benchmarked tasks," Kazemnejad said.

Their experiments with larger models such as GPT-OSS 120B showed robust performance with Delethink across a range of complex tasks. This latent ability provides a strong starting point for RL training, helping explain why the method is so effective. “Together, these results suggest that Delethink is compatible and scales with state-of-the-art models,” the researchers conclude.

The success of Markovian Thinking shows it may be possible for "next-generation reasoning models to think for millions of tokens," the researchers note. This opens the door to fundamentally new AI capabilities, moving beyond current constraints.

"Markovian Thinking... opens the path for models that can 'think' for very long horizons, which we view as a necessary step toward eventual scientific discovery," Kazemnejad said. "Our approach removes a key bottleneck and can allow training for much longer horizon tasks, which enables next-gen capabilities."

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 人工智能 推理 马尔可夫思维 Delethink 计算效率 AI Large Language Models Reasoning Markovian Thinking Computational Efficiency
相关文章