Given a question q, an LLM produces an output o. This output may consist of multiple steps (∣o∣>1), of which the last is the actual answer; or, it may simply be the answer (∣o∣=1). We consider the set-up of reinforcement learning with verifiable rewards (RLVR); this includes mathematical questions, or coding questions for which unit tests are provided. Contrast this with balancing a pole on a cart. In the LLM framework, a policy π=πθ is indexed by the weights θ of the LLM.
The high-level objective is to maximise the expected reward:
maximiseJ(θ):=Eq∼μ,o∼πθ(⋅∣q)[R(q,o)]over weightsθ,
where μ is some fixed distribution over the questions. The reward R(q,o) may be binary: 1 if the answer is correct; 0 if it is wrong. But, typically, it is more nuanced, and will depend on the full output o—including the reasoning trace.
Policy Optimisation
This overview is intended to give an overview of the most important details. Further (less importoant) details can be found in my RL deep-dive; the framework there is general state–action, not question–answer. My (older) DeepSeekMath summary introduces GRPO in the QA framework.
I found some lectures/courses useful.
Sergey Levine's Deep RL course is clear and in-depth; see, particularly, Lectures 5 and 9 on (Advanced) Policy Gradients.
Pieter Abbeel's Foundations of Deep RL provides a decent overview, but the details are often fuzzy; see, particularly, Lectures 3 and 4 on Policy Gradients and TRPO & PPO.
John Schulman covers policy gradients, TRPO and PPO in a single Deep RL Bootcamp lecture—but somewhat runs out of time when covering clipping.
Vanilla Policy Gradients
In order to do a gradient step, we need to determine
J′(θ)=Eq∼μ[∑oR(q,o)∇θπθ(o∣q)]=Eq∼μ[∑oR(q,o)πθ(o∣q)∇θlogπθ(o∣q)]=Eq∼μ,o∼πθ(⋅∣q)[R(q,o)∇θlogπθ(q∣o)],
using the identity
∇θlogπθ(q∣o)=∇θπθ(q∣o)/πθ(q∣o).
An alternative representation of the objective uses importance sampling (in essence, just change of measure):
J(θ)=Eq∼μ,o∼πθold(⋅∣q)[πθold(o∣q)πθ(o∣q)R(q,o)].
This removes the variable θ from the generation aspect: all generations come from the current policy πθold.
To reduce bias, the reward is usually replaced with the advantageAπθold(q,o) under the current policy:
Aπ(q,o):=R(q,o)−Vπ(q)whereVπ(q):=Eo∼π(⋅∣q)[R(q,o)].
With this definition,
J(θ)−J(θold)=Eq∼μ[Eo∼πθ(⋅∣q)[R(q,o)]−Eo∼πθold(⋅∣q)[R(q,o)]]=Eq∼μ[Eo∼πθ(⋅∣q)[R(q,o)−Eo∼πθold(⋅∣q)[R(q,o)]]]=Eq∼μ,o∼πθ(⋅∣q)[Aπ(q,o)].
Applying the importance sampling trick,
Δ(θ,θold):=J(θ)−J(θold)=Eq∼μ,o∼πθold(⋅∣q)[πθold(o∣q)πθ(o∣q)Aπθold(q,o)].
Given that the current policy is πθold, maximising either θ↦J(θ) or θ↦Δ(θ,θold) is equivalent.
Below, Δ′ indicates derivative (∇θ) with respect to the first argument; in particular, Δ′(θold;θold)=J′(θold).
TRPO: Trusted Region Policy Optimisation
In the usual RL framework, an entire trajectory of states (and actions) is sampled from the proposed policy. In the state–action framework, sampling instead from the old policy—which is far preferable computationally, as the same rollouts can be used to compare multiple proposals—leads to a first-order approximation; this is detailed in my RL deep-dive or Sergey Levine's Lecture 9.1. To ensure the new and old policies are close, a KL constraint is included:
maximiseΔ(θ,θold)subject toKL(θ∥θold)≤ε
where
KL(θ∥θold):=Eq∼μ[KL(πθ(⋅∣q)∥πθold(⋅∣q))].
This complication doesn't actually arise in the LLM framework: it is 'off-policy' there in the sense that there are no trajectories to depend on the current policy. Nevertheless, KL regularisation can often be helpful.
Replacing the objective and constrain with leading-order approximations, gradient ascent actually solves this exactly:
θ⋆=θold+αF(θold)−1Δ′(θold,θold)
where
[
\alpha
= \sqrt{2 \varepsilon / J'(\theta_\text{old})^\top F(\theta_\text{old}) J'(\theta_\text{old})},
\quad\text{and}\quad
F(\theta)
= \mathbb E_{q \sim \mu, o \sim \pi_\theta(\cdot \mid q)}[ (\nabla_\theta \log \pi_\theta(q, o)) (\nabla_\theta \log \pi_\theta(q, o))^\top ]$
is the Fisher-information matrix. This is outlined in my RL deep-dive and detailed in Sergey Levine's Lecture 9.4.
PPO: Proximal Policy Optimisation
Calculating or estimating the Fisher matrix, or using the conjugate gradient method, is computationally expensive. PPO replaces the objective with
JPPO(θ;λ):=J(θ)−λKL(θ∥θold),
where λ is a tunable hyperparameter. Again, the reward can be replaced with the advantage—the Δ version.
The method of Lagrange multiplies means there is someλ so that the optimal θ for PPO is the same as for TRPO; finding this λ is infeasible. In practice, dual descent can be used, repeating the following steps until satisfied:
maximiseJPPO(θ;λ)overθ;λ←λ+α(KL(θ∥θold)−ε).
Or, more informally, maximise for a given penalty λ; if the resulting KL is too large/small, then increase/decrease the penalty λ.
More details are given in Sergey Levine's Lecture 9.3
A popular extension replaces the original objective
Δ(θ,θold)=Eq∼μ,o∼πθ(⋅∣q)[R(q,o)]=Eq∼μ,o∼πθold(⋅∣q)[πθold(o∣q)πθ(o∣q)Aπθold(q,o)]
with
ΔPPOclip(θ,θold;ε)=Eq∼μ,o∼πθold(⋅∣q)[min–clipεε(πθold(o∣q)πθ(o∣q),Aπθold(q,o))]
where
min–clipεoldεhigh(r,a):=min{ra,clip(r,1−εlow,1+εhigh)a}.
The clipping is a type of regularisation: it stops the probability ratio from changing too much, potentially over-reacting to the observed advantage. As such, the KL penalty is then sometimes removed. The default in many libraries is εhigh=εlow=ε=0.2.
Calculating the advantage, which depends on the current policy, precisely is the hard part. An estimator Aπθold(q,o) is required. One simple option is Monte Carlo estimation average 100 rollouts (which themselves may be long) from the proposed state–action. This is unbiased, but often high bias and computationally heavy. To reduce bias, an estimation Vφ the value function Vφ is learned—typically, φ are the weights of a neural network—and used after a few steps, such as in GAE. See my RL deep-dive or Sergey Levine's Lecture 6.4 for more details.
GRPO: Group Relative Policy Optimisation
GRPO has no critic model, using Monte Carlo style. Here, "rollouts" are simply the output (answer+reasoning) to a question.
However, instead of discarding the 100 rollouts (here, just outputs/answers to the same question) used to estimate the baseline, they are used in the update step. Concretely, the objective is
ΔGRPO(θ,θold;ε,λ,G):=Eq∼μ,o1,...,oG∼iidπθold(⋅∣q)[G1i=1∑G∣oi∣1t=1∑∣oi∣min–clipεε(πθold(oi,t∣q,o<t)πθ(oi,t∣q,o<t),Ai,t)−λKL(θ∥θref)],
where Ai,t is an estimate of the advantage that depends only on (q,o1,...,oG) and θref is the parameter vector of a reference model—typically, the pre-trained model, prior to RL fine-tuning. A canonical choice of estimator is
Ai,t:=Ai:=std(r1,...,rG)ri−mean(r1,...,rG)whereri=R(q,oi).
Other penalties include formatting, language and length.
The rollouts are being used both to estimate the mean reward for the question under the current policy and in the objective. Usually, the KL is estimated using the rollouts too: move KL(θ∥θref) inside the average G1∑i=1G(...) over G and replace it with the (non-negative) unbiased estimator
KLi:=riref−1−logrirefwhereriref:=πθref(oi∣q)/πθ(oi∣q).
A typical choice of G is between 4 and 32. If it is too large, then any unlikely but high-reward outputs will be swamped by the typical ones. If output-generation is not the bottleneck, an option would be to sample more to estimate the baseline (the mean and standard deviation above) but only use a selection (perhaps ones with atypical reward?) in the update.
Extensions to GRPO
The following algorithms are variants of and adjusments to GRPO, rather than a new approach.
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimisation
DAPO introduces five many adjustments to GRPO.
Clip Higher.
The upper threshold εhigh in the clipping min–clipεlowεhigh is increased. This allows low-likelihood, high-reward tokens to be boosted significantly more, increasing the exploration of the policy.
Dynamic Sampling.
Batch gradients are diluted by groups with (little to) no signal. DAPO filters these out, sampling more groups until the batch is complete. An alternative to oversampling is to upweight the remainder.
Token-Level Policy Gradient Loss.
GRPO assigns each sample equal weight: G−1∑i=1G∣oi∣−1∑t=1∣oi∣(...). As such, tokens in longer reponses are underrepresented. DAPO uses equal token weight: (∑i=1G∣oi∣)−1∑i=1∣oi∣∑t=1∣oi∣(...).
Overlong Reward Shaping.
Truncating overlong responses could penalise sound reasoning. DAPO first filters overlong responses, masking their loss, then uses a soft punishment:
no penalty up to a threshold,
maximal penalty at a higher one
and
linear interpolation between.
Remove KL Penalty.
The KL penalty regulates how far the online policy can go from the reference policy. However, in practice, it diverges significantly anyway, and the KL penalty may dominate the loss. DAPO removes the KL penalty—sets λ=0.
ProRL: Prolonged Reinforcement Learning
ProRL utilises clip-higher (with εhigh=0.4) and token-level loss from DAPO. Additionally, it (re)introduces two aspects.
KL Regularisation and Reference-Policy Reset.
Where DAPO removes the KL term, ProRL keeps it, but periodically hard-resets the reference weights θref to a recent snapshot of the online policy and reinitialise the optimiser states. This prevents the KL penalty from dominating, whilst retaining the benefits of KL regularisation.
Prolonged RL.
Typically, RL implementations run for no more than a few hundred steps. ProRL uses 2000+. The hypothesis is that this gives the model time to explore and uncover new strategies.
Magistral
Magistral uses all the extensions from DAPO (with εhigh≈0.25). Additionally, it introduces a new aspect.
Advantage Normalisation.
A correct answer to a hard question (which has low standard deviation) gets upweighted significantly. There is danger of noise fitting—particularly with a high εhigh. Instead, Magistral normalise with respect to minibatches. Concretely, the advantage is first centred wrt the question: Ai:=ri−mean(r1,...,rG). Then they are normalised wrt the minibatch: Ai:=Ai/std((Ai)i∈B), where B is a minibatch. However, they perform an ablation study suggesting this is of little benefit in their case.