DeepSeekMath Summary
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
High-Level Summary
- Introduce DeepSeekMath 7B, a LLM focused on mathematical capabilities
- Achieves comparable performance with Minerva 540B, even with ~77x fewer parameters
- Introduces and uses Group Relative Policy Optimisation (GRPO): GRPO foregoes the critic model, instead estimating the baseline from group scores
- Provide a unified paradigm to understand different models, and use to explore reasons behind the effective RL
The main theoretical contribution is the introduction of GRPO, which extends PPO.
Evolution: PPO to GRPO
Proximal Policy Optimisation (PPO) is an actor—critic RL algorithm which maximises a surrogate objective:
θk+1=argθmaxEq,a∼πθk[JPPO(q,o,θk,θ)]
with
JPPO(q,o,θ′,θ)=∣o∣1t=1∑∣o∣min{πθ′(ot∣q,o<t)πθ(ot∣q,o<t)At,(1+sgn(At)ε)At}−βDKL(πθ∥πrel).
- πθ/πθ′ are the current/old policy models;
- q/o are questions/outputs sampled from the question dataset/old policy;
- At is the advantage based on the rewards and a learned value function;
- ε is a clipping hyperparameter for stabilising training;
- β is a hyperparameter governing per-token KL penalty.
The value function is treated as a baseline in estimating the advantage. In the LLM context, usually only the last token is assigned a reward score, which may complicate training a value function that is accurate at each token. Group Relative Policy Optimisation (GRPO) addresses this:
- it removes the need for additional value-function approximation;
- instead, it uses the average reward of multiple sampled outputs (to same question) as the baseline.

More specifically, for each question q, GRPO samples a group of outputs {o1,...,oG} from the old policy πθ′ and maximises an analogous surrogate objective
JGRPO(q,{o1,...,oG},θ′,θ)=G1∑i=1GJPPO(q,oi,θ′,θ).
except that now the advantage At is replaced with the estimate A^i,t based on the rewards of the outputs inside each group only; in all its glory,
JGRPO(q,{o1,...,oG},θ′,θ)=G1∑i=1G∣oi∣1∑t=1∣ot∣min{πθ′(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t)A^i,t,(1+sgn(A^i,t)ε)A^i,t}−βDKL(πθ∥πrel).
An unbiased estimator of DKL(πθ∥πrel) is used, namely
DKL(πθ∥πrel)≈πθ(oi,t∣q,oi,<t)πref(oi,t∣q,ii,<t)−logπθ(oi,t∣q,oi,<t)πref(oi,t∣q,ii,<t)−1≥0.
One of the key benefits of GRPO over PPO is not needing to learn/evaluate the advantage At, which can be costly. Two options are mentioned: "outcome" in #4.1.2 and "process" in #4.1.3. For example, in "outcome",
A^i,t=(ri−mean(r))/stddev(r)for allt.
where ri is the reward for oi and r=(ri)i=1G; in particular, the same advantage is prescribed to each timestep A^i,t.
Key Differences: PPO vs GRPO