ParaThinker：用并行思维突破LLM推理瓶颈

Why Do Sequential LLMs Hit a Bottleneck?

Test-time compute scaling in LLMs has traditionally relied on extending single reasoning paths. While this approach improves reasoning for a limited range, performance plateaus quickly. Experiments on DeepSeek-R1-distill-Qwen-1.5B show that increasing token budgets beyond 32K (up to 128K) yields negligible accuracy gains. The bottleneck arises from early token commitment, where initial errors propagate through the entire chain-of-thought. This effect, referred to as Tunnel Vision, indicates that the scaling issue is methodological rather than a fundamental limit of model capacity.

Tunnel Vision and How Is It Diagnosed?

Researchers quantified recovery ability by forcing models to continue from erroneous prefixes of varying lengths (100–1600 tokens). Accuracy declined monotonically as prefix length increased, demonstrating that once committed to a flawed trajectory, the model cannot recover—even when given additional computation budget. This confirms that sequential scaling allocates compute inefficiently.

How Does ParaThinker Introduce Parallel Thinking?

A team of researchers from Tsinghua University introduce ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. ParaThinker operationalizes native thought parallelism by generating multiple reasoning trajectories in parallel and merging them into a final response.

Key architectural components include:

Specialized control tokens

<think i>

Thought-specific positional embeddings

Two-phase attention masks

A critical efficiency gain comes from reusing KV-caches from the reasoning stage in the summarization phase, eliminating redundant re-prefilling.

How Is ParaThinker Trained for Parallel Reasoning?

Supervised fine-tuning (SFT) was conducted using multi-path reasoning datasets. Training data was constructed by sampling multiple solution paths from teacher models (DeepSeek-R1, GPT-OSS-20B). Each example included several <think i> trajectories and a final <summary> solution. Randomized token sampling ensured generalization to more paths at inference than seen in training.

The fine-tuning used Qwen-2.5 models (1.5B and 7B parameters), with maximum context length 28K tokens. Data sources included Open-R1, DeepMath, s1k, and LIMO, supplemented with additional solutions sampled at temperature 0.8. Training was run on multiple A800 GPUs.

What Are the Experimental Results?

Evaluation on AIME 2024, AIME 2025, AMC 2023, and MATH-500 yields the following:

Accuracy

+12.3% accuracy

+4.3%

+7.5% accuracy

+2.0%

63.2% pass@1

Efficiency

7.1%

Termination strategy

First-Finish

What Do Ablation Studies Indicate?

Dataset-only fine-tuning

Removing thought embeddings

Re-prefilling baselines

How Does ParaThinker Compare to Other Methods?

Conventional parallel strategies such as majority voting, self-consistency, and Tree of Thoughts require external verifiers or post-hoc selection, limiting scalability. Diffusion-based token-parallel methods perform poorly on reasoning tasks due to sequential dependency. Architectural approaches like PARSCALE demand structural changes and pretraining. In contrast, ParaThinker preserves the Transformer backbone and introduces parallelism at the reasoning stage, integrating multiple KV-caches into a unified summarization step.

Summary

ParaThinker demonstrates that test-time scaling bottlenecks are an artifact of sequential reasoning strategies. By allocating compute across width (parallel trajectories) rather than depth (longer chains), smaller models can outperform significantly larger baselines with minimal latency overhead. This establishes native thought parallelism as a critical dimension for future LLM scaling.

Check out the PAPER here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post ParaThinker: Scaling LLM Test-Time Compute with Native Parallel Thinking to Overcome Tunnel Vision in Sequential Reasoning appeared first on MarkTechPost.

Why Do Sequential LLMs Hit a Bottleneck?

Tunnel Vision and How Is It Diagnosed?

How Does ParaThinker Introduce Parallel Thinking?

How Is ParaThinker Trained for Parallel Reasoning?

What Are the Experimental Results?

What Do Ablation Studies Indicate?

How Does ParaThinker Compare to Other Methods?

Summary

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签