DeepSeek-R1模型发布，提升复杂推理能力

DeepSeek AI released DeepSeek-R1, an open model that rivals OpenAI's o1 in complex reasoning tasks, introduced using Group Relative Policy Optimization (GRPO) and RL-focused multi-stage training approach.

Understanding Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to improve the reasoning capabilities of LLMs. It was introduced in the DeepSeekMath paper in the context of mathematical reasoning. GRPO modifies the traditional Proximal Policy Optimization (PPO) by eliminating the need for a value function model. Instead, it estimates baselines from group scores, reducing memory usage and computational overhead. GRPO, now also used by the Qwen team, can be used with rule/binary-based Rewards as well as General Reward Models to improve models on helpfulness.

Sampling

Reward Scoring

Advantage Calculation

Policy Optimization

The Key Differences from Proximal Policy Optimization (PPO) are

No Value Function

Group-Based Advantage

KL Divergence:

Exhibit: Pure Reinforcement Learning (R1-zero)

In building DeepSeek R1, the team gained deep insights from experimenting with reinforcement learning on their base model. Starting with DeepSeek V3, they applied GRPO to unsupervised reasoning text completions rule-based reward models that focused on aspects like format, mathematics, and coding:

Accuracy rewards

Format rewards

’ and ‘

This leads to a pass@1 score on AIME 2024 increasing from 15.6% to 71.0%, reaching performance levels comparable to OpenAI-o1-0912 alongside output token length per problem increasing, indicating the model naturally learns to solve tasks with morehttps://www.philschmid.dehttps://www.philschmid.de/static/blog/deepseek-r1/r1-zero.png

This has the drawback of leading to poor readability and language mixing but it was solved for R1 using a multi-stage approach with alternating SFT → RL steps.

The Multi-Stage Training of DeepSeek R1

To prevent the early unstable cold start phase of reinforcement training (RL) training from the base model, the team started with supervised fine-tuning.

Stage 1/4 Base to Supervised Fine-Tuning (SFT)

Collected up to 10k token-long chain-of-thought (CoT) using the fine-tuned models, R1-zero and human annotator. The data is used to fine-tune Deepseek V3 base to improve readbility and coherence.

Stage 2/4 RL for Reasoning

Used the same RL pipeline as R1-Zero, focusing on reasoning-intensive tasks such as coding and math using the same Rule-Based Reward Models. This time, an additional reward for "language consistency" is used to help the model stick to the same language.

Stage 3/4 Rejection Sampling and SFT

Generated large synthetic dataset using Reject Sampling (RS) focusing on writing, role-playing, and other general-purpose tasks. The model from Stage 2 was used with Deepseek V3 as a Judge to generate 600k reasoning-related samples and 200k for writing, role-playing, and other general-purpose tasks using portions of the SFT dataset of DeepSeek-V3 or regenerating them with CoT included.

Stage 4/4 RL for Helpfulness

In the Final Stage, GRPO is used again with a combination of Rule-Based and Outcome Reward Models to improve the model's helpfulness and harmlessness. Leading to the

Surprises

DeepSeek didn't use Monte Carlo Tree Search (MCTS) or Process Reward Models (PRM).Fine-tuning before applying GRPO can actually make the training process faster and more stable.Rule-based rewards focused on accuracy and format are more effective than complex rewards models.

Understanding Group Relative Policy Optimization (GRPO)

Exhibit: Pure Reinforcement Learning (R1-zero)

The Multi-Stage Training of DeepSeek R1

Surprises

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签