The high-level objective for policy optimisation is to maximise the expected reward over policies:
maximiseEπ[∑t=0HR(st)]over policiesπ;
here, (st,at) is the time-t state–action pair, R is the reward function and H is some time horizon. Typically, the policies are parametrised by a vector θ, and are stochastic policies: πθ(a∣s) is the probability of taking action a when in state s.
This summary is focused on policy optimisation, and has four parts.
Vanilla policy gradient
Introduce trust regions: TRPO
Extend to proximal policy optimisation (v1 & v2)
Introduce group relative estimation: GRPO
Vanilla Policy Gradient
The 'vanilla' policy gradient algorithm uses a baselineb to estimate the advantage. This could be a value function, perhaps being learned.
for iteration = 1, 2, ...:
Collect a set of trajectories by executing the current policy
At each timestep t in each trajectory, compute
the returnRt=∑t′≥tγt′−tr(st′,at′,st′+1)
the advantage estimateA^t=Rt−b(st)
Re-fit the baseline by minimising (b(st)−Rt)2, summed over all trajectories and timesteps, eg via Monte Carlo estimates or boostrapping
Update the policy using a policy gradient estimate, which is a sum of all terms ∇θlogπθ(at∣st)A^t
TRPO was introduced by Schulman et al in a 2015 paper; see also Schulman's thesis. The 'trust region' constrains how far away the new policy can be from the current one. In doing so, it enables some local approximations which make the optimisation easier.
Before proceeding formally, we set up some notation. For simplicity, we assume an infinite time horizon: H=∞. We assume that the initial state S0∼p0 and there is some transition kernel p such that St+1∼p(⋅∣s,a) given (St,At)=(s,a), independent of the policy.
Given a policy π, denote the state–action value function
where p(⋅∣s,a) denotes the one-step probabilities from state s choosing action a. In particular, if (St′,At′)t≥0∼π′, then St+1′∼p(⋅∣St′,At′) given (St′,At′). Hence,
in other words, sample a state S′∼ρπ′ according to the π′-occupation measure ρπ′ and an action A′∼π′(⋅∣S′) for this state according to π′, then take the expectation.
The sum ∑a′π′(a′∣s)Aπ(s′,a′) is a relatively simple function of π′. Eg, if π′=(1−ε)π+εν, then
∑a′π′(a′∣s′)Aπ(s′,a′)=ε∑a′ν(a′∣s′)Aπ(s′,a′),
since EA∼ν(⋅∣s)[Aν(s,a)]=0 for any ν—"the advantage of a measure over itself is 0". However, the dependence of ρπ′(s′) on π′ is far more complex.
We plan to do small updates π→π′, by gradient ascent, or similar. As such, we simply replace ρπ′ by ρπ:
Indeed, ρπθ→ρπθ′ does not change the leading order as θ′→θ, and it already appears in the 'derivative' (difference) part. So, a sufficiently small step πθ→πθ′ that improves Rθ=Rπθ also improves the original R. But, it does not give any guidance on how large a step can be taken.
Constrained Optimisation
One potential avenue is to use a conservative mixture: let π⋆=argmaxπ⋆Lπ(π⋆) and define π′:=απ⋆+(1−α)π. Then, Kakade & Langford (2002) derived the lower bound
This lower bound on R(π′)−R(π′) tends to be pessimistic in practice. An alternative approach, inspired by this, is to use a constrained optimisation problem:
given π,maximiseLπ(π′)over policies π′subject toKL⋆(π∥π′)≤δ.
However, this imposes a KL-constraint at every state s, making it impractical. Instead, we replace this by an average KL-divergence:
KL(π∥π′)=ES∼ρπ[KL(π(⋅∣S)π′(⋅∣S))].
It turns out (see below) that this average is much easier to estimate. We use it in our constraint:
given π,maximiseLπ(π′)over policies π′subject toKL(π∥π′)≤δ.
So, the gradient at the current policy is the same as in the usual policy gradient.
Sample-Based Estimation of the Objective and Constraint
The importance sampling also means that we need only evaluateπ′(⋅∣s) at a (random) location A, not sample from π′(⋅∣s). Instead, we just need to sample S∼ρπ and then A∼π(⋅∣S) given S. This enables us to estimate the expectation of
by simply averaging (s,a)↦Aπ,π′(s,a)=π(a∣s)π′(a∣s)Aπ(s,a) over rollouts (St(i),At(i))t≥0∼iidπ (i=1,...,M, say) under the initial measure π. This even enables evaluation of multiple candidate π′-s from the same collection of rollouts. When optimising over π′, we may ignore the additive R(π) term.
We can estimate the KL term in the constraint via similar rollouts. A naive method samples many actions per (observed) state to estimate the KL divergence KL(π(⋅∣s)π′(⋅∣s)) directly. A smarter method expands the KL:
This can be estimated via rollouts under π as before. In fact, we likely even collected the ratios π(A∣S)π′(A∣S) during the estimation of the objective.
Last, we could actually use a different distribution in the importance sampling. Using A∼π(⋅∣s) is perhaps the simplest option, and referred to as single path in Section 5.1 of the original TRPO paper; a vine method, which requires being able to restore the system to particular previous states, is discussed in Section 5.2 there.
Using Gradient Ascent and the Fisher Information
Alternatively, the KL can be estimated more 'directly' using the Fisher information matrix F:
The Fisher matrix is just an expectation, so can be estimated with (the same) samples. This leading-order approximation can be combined with a linearisation of the objective:
We may not have access to the advantage Aπ(s,a), and need to replace it with an estimator Aπ(s,a). There are a couple of main options.
Method
Bias
Variance
Sample Efficiency
MC
Low
High
Low
A3C
High
Low
High
GAE
Tunable
Tunable
Balanced
The time-t reward is
G=r0+γr1+γ2r2+….
Monte Carlo (MC) simply averages rollouts, using the full trajectory: GMC=G; this makes it slow. Also, the rewards may vary significantly in space, leading to a high-variance estimate.
Asynchronous Advantage Actor–Critic (A3C) goes in the other direction:
GA3C(k)=r0+γr1+…+γkrk+γk+1V(sk+1),
where sk+1 is the time-(k+1) state and V is the value function—which itself may be being learnt. An estimator of this has reduced variance, since it is only looking a few steps into the future.
Finally, Generalised Advantage Estimation (GAE) takes an exponential weighting of the A3C(k):
The method of Lagrange multipliers actually says that, given δ, there is some β so that the optimality points agree (ie, same optimiser).
In practice, it can be difficult to choose the hyperparameter β. One adaptive option uses dual descent updates for β, to try to match it to the original constraint δ. Iterate
while unsatisfied: find the optimal πβ′ for the current β, and abbreviate d:=KL(π∥πβ′)
If d is too large (say, d>23δ), then increase the penalty β (say, β←2β)
If d is too small (say, d<32δ), then decrease the penalty β (say, β←β/2)
Otherwise, the KL is 'good enough' (say, 32δ≤d≤23δ), so become satisfied
The clipping stops the probability ratio rπ(π′) from changing too much. The objective is the pessimistic minimum of this clipping and the previous, unclipped version. The clipping has no effect to first order around the current policy—ie, if π′≈π. It is really there to stop overly large updates from happening, which was observed without this clipped term in the minimum.
Thinking about positive advantage for the moment, we want to increase the probability ratio. Once the clipping 1+ε is achieved, changing π′ cannot increase the ratio. So, there is nothing to be gained beyond this point. An analogous statement holds for pushing the ratio down when the advantage is negative. This restricts the influence for each state–action pair.
This tends to work a bit better than TRPO on continuous control (eg, robotics) and much better on discrete problems (eg, Atari).
GRPO: Group Relative Policy Optimisation
General Method
The "critic" model, required to estimate the value of actions, in RL algorithms like PPO requires training. This training can be resource intensive. GRPO removes the need for a critic model, instead sampling a "group" of actions and comparing their relative reward. This does not take into account future rewards (ie, the value function at the new state), which the critic in PPO does. Further, it is an off-policy method: there is a fixed distribution μ over states (or questions) from which the states are drawn.
More precisely, instead of drawing a single action A∼π(⋅∣s), many (iid) actions A1,...,AG∼iidπ(⋅∣s) are drawn. The i-th advantage Ai (ie, of Ai) is calculated relative to (A1,...,AG). For example, one option is to take
Ai=(ri−meanr)/stdr,
where ri is the (random) reward received after taking action Ai; naturally,
meanr=G1∑i=1Griandstdr=G1∑i=1G(ri−meanr).
Additionally, a KL penalisation term is included in the objective, between the new policy π′ and a reference policy πref. This is not the old policy πθ; DeepSeek suggest it is usually the initial supervised fine-tune model—ie, the one before GRPO RL is applied.
Altogether, the final surrogate objective, to be maximised, is
where
ri:=π′(A∣S)/π(A∣S)
and Ai is the (relative) advantage, as before; here, min–clip(x):=min{x,clip(x,1+ε,1−ε)}. Importantly, the whole sum must be inside the expectation, since the (relative) advantage is calculated based on (A1,...,AG). Notice how S∼μ, not a measure which depends on the current policy π (or the proposed π′).
We (unnecessarily) left the KL inside the expectation, and even the summation. This is because, rather than calculate the KL exactly, the following unbiased estimator is typically used:
The clipping prevents the new policy π′ from being too far from the old policy π, and this final KL penalty prevents any policy used from being too far from the original (reference) policy πref.
Noteably, GRPO does not consider any future rewards—it would need a critic (or some other addition) to do that. This makes is suitable for Question–Answer models, which is the context it was introduced in. Such scenarios are special because there are no 'onward states' beyond the answer, and hence no future rewards.
Set-Up for Question–Answer Models
Question–Answer models can, and frequently do, have reasoning traces (interpreted broadly). In particular, given a question q, an action A may generate output O1, ..., Om, where m is the (enforced) length of the reasoning. In this case, and individual reward can (but needn't) be applied to each part of the trace. This leads to a summation of the form
G1∑i=1G∣Oi∣1∑t=1mmin–clip(ri,tAi,t)
where
ri,t=π′(oi,t∣q,oi,1:t−1)/π(oi,t∣q,oi,1:t−1)
and Ai,t is the advantage of the t-th token in the i-th trace. A canonical choice is to reward the token based on the final result (there is no need for the reward to be 'Markovian'):
Ai,t=At.
But, it is possible to use more refined strategies.
Comparison: TRPO/PPO vs GRPO
Training Differences: TRPO/PPO vs GRPO
One important distinction between the TRPO/PPO and GRPO paradigms is the manner in which the (surrogate) objective is estimated.
TRPO/PPO is an on-policy method.
It samples a long trajectory (St,At)t≥0 in a given step, in order to get a good estimate on the expectation, which includes S∼ρπ, which takes into account the discounting.
If the environment does not permit long samples (>1k steps), such as an Atari game, then many trajectories are sampled; all start from a fixed distribution p0 over initial states.
As the algorithm processes, the distribution ρπ changes.
GRPO is an off-policy method.
The state (eg, question) is drawn from a fixed distribution S∼μ. Once the state S has been sampled, a group of G actions are sampled from the same state: A1,...,AG∼iidπ(⋅∣S).
These G state–action pairs (each with the same state) are averaged.
GRPO really is designed with question–answer models in mind. TRPO/PPO make less sense in this context, since there are no future rewards to discount (a bit like γ=0), and the trajectories are all single-step.
Potential Extension for GRPO
The lack of a critic model makes the process both faster to train, but also (often, more importantly) more memory efficient. The critic may often be as large as the modle being trained. However, it is my feeling that it is really the lack of future rewards which removes the need for a critic, rather than the use of relative rewards:
estimating value functions is difficult because of the need to behave optimally in the future;
if the future rewards are ignored (γ=0), this goes away.
The sample means and standard deviations are proxies for the true means and standard deviations under the current policy. These may be quite crude estimators when the number G of samples is small—eg, G=4. Potentially, a better estimate on these—eg, by using a large number of samples—may improve performance.
However, increasing G may hurt performance elsewhere: the (surrogate) objective is an average, so the signal may decrease, by some type of concentration/LLN. One option could be to increase G and divide by G instead of G.