DeepSeekMath Summary

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath Summary

High-Level Summary

The main theoretical contribution is the introduction of GRPO, which extends PPO.

Evolution: PPO to GRPO

Proximal Policy Optimisation (PPO) is an actor—critic RL algorithm which maximises a surrogate objective:

θk+1=argmaxθEq,aπθk[JPPO(q,o,θk,θ)]\theta_{k+1} = \arg\max_\theta \mathbb E_{q, a \sim \pi_{\theta_k}}[ J_\textsf{PPO}(q, o, \theta_k, \theta) ]

with

JPPO(q,o,θ,θ)=1ot=1omin{πθ(otq,o<t)πθ(otq,o<t)At,(1+sgn(At)ε)At}βDKL(πθπrel).J_\textsf{PPO}(q, o, \theta', \theta) = \frac1{|o|} \sum_{t=1}^{|o|} \min\biggl\{ \frac{\pi_\theta(o_t \mid q, o_{< t})}{\pi_{\theta'}(o_t \mid q, o_{< t})} A_t, (1 + \textup{sgn}(A_t) \varepsilon) A_t \biggr\} - \beta D_\textsf{KL}( \pi_\theta \mathrel{\|} \pi_\textsf{rel}).

The value function is treated as a baseline in estimating the advantage. In the LLM context, usually only the last token is assigned a reward score, which may complicate training a value function that is accurate at each token. Group Relative Policy Optimisation (GRPO) addresses this:

PPO vs GRPO

More specifically, for each question qq, GRPO samples a group of outputs {o1,...,oG}\{o_1, ..., o_G\} from the old policy πθ\pi_{\theta'} and maximises an analogous surrogate objective

JGRPO(q,{o1,...,oG},θ,θ)=1Gi=1GJPPO(q,oi,θ,θ).\textstyle J_\textsf{GRPO}(q, \{o_1, ..., o_G\}, \theta', \theta) = \frac1G \sum_{i=1}^G J_\textsf{PPO}(q, o_i, \theta', \theta).

except that now the advantage AtA_t is replaced with the estimate A^i,t\hat A_{i, t} based on the rewards of the outputs inside each group only; in all its glory,

JGRPO(q,{o1,...,oG},θ,θ)=1Gi=1G1oit=1otmin{πθ(oi,tq,oi,<t)πθ(oi,tq,oi,<t)A^i,t,(1+sgn(A^i,t)ε)A^i,t}βDKL(πθπrel).\textstyle J_\textsf{GRPO}(q, \{o_1, ..., o_G\}, \theta', \theta) = \frac1G \sum_{i=1}^G \frac1{|o_i|} \sum_{t=1}^{|o_t|} \min\bigl\{ \frac{\pi_\theta(o_{i, t} \mid q, o_{i, < t})}{\pi_{\theta'}(o_{i, t} \mid q, o_{i, < t})} \hat A_{i, t}, (1 + \textup{sgn}(\hat A_{i, t}) \varepsilon) \hat A_{i, t} \bigr\} - \beta D_\textsf{KL}(\pi_\theta \mathrel{\|} \pi_\textsf{rel}).

An unbiased estimator of DKL(πθπrel)D_\textsf{KL}(\pi_\theta \mathrel{\|} \pi_\textsf{rel}) is used, namely

DKL(πθπrel)πref(oi,tq,ii,<t)πθ(oi,tq,oi,<t)logπref(oi,tq,ii,<t)πθ(oi,tq,oi,<t)10.D_\textsf{KL}(\pi_\theta \mathrel{\|} \pi_\textsf{rel}) \approx \frac{\pi_\textsf{ref}(o_{i, t} \mid q, i_{i, < t})}{\pi_\theta(o_{i,t} \mid q, o_{i, < t})} - \log \frac{\pi_\textsf{ref}(o_{i, t} \mid q, i_{i, < t})}{\pi_\theta(o_{i,t} \mid q, o_{i, < t})} - 1 \ge 0.

One of the key benefits of GRPO over PPO is not needing to learn/evaluate the advantage AtA_t, which can be costly. Two options are mentioned: "outcome" in #4.1.2 and "process" in #4.1.3. For example, in "outcome",

A^i,t=(rimean(r))/stddev(r)for allt.\hat A_{i,t} = \bigl( r_i - \textup{mean}(r) \bigr) / \textup{stddev}(r) \quad\text{for all}\quad t.

where rir_i is the reward for oio_i and r=(ri)i=1Gr = (r_i)_{i=1}^G; in particular, the same advantage is prescribed to each timestep A^i,t\hat A_{i,t}.

Key Differences: PPO vs GRPO