RL Algorithms Deep-Dive

RL Algorithms Deep-Dive: TRPO, PPO & GRPO

RL Algorithms Deep-Dive

TRPO: Trust Region Policy Optimisation

GRPO: Group Relative Policy Optimisation

Training Differences: TRPO/PPO vs GRPO

High-Level Objective

The high-level objective for policy optimisation is to maximise the expected reward over policies:

maximiseEπ[t=0HR(st)]over policiesπ;\textstyle \textsf{maximise}\quad \mathbb E_\pi\bigl[\sum_{t=0}^H R(s_t)\bigr] \quad\textsf{over policies}\quad \pi;

here, (st,at)(s_t, a_t) is the time-tt state–action pair, RR is the reward function and HH is some time horizon. Typically, the policies are parametrised by a vector θ\theta, and are stochastic policies: πθ(as)\pi_\theta(a \mid s) is the probability of taking action aa when in state ss.

This summary is focused on policy optimisation, and has four parts.

  1. Vanilla policy gradient
  2. Introduce trust regions: TRPO
  3. Extend to proximal policy optimisation (v1 & v2)
  4. Introduce group relative estimation: GRPO

Vanilla Policy Gradient

The 'vanilla' policy gradient algorithm uses a baseline bb to estimate the advantage. This could be a value function, perhaps being learned.

See Lecture 3: Policy Gradients and Advantage Estimation in Pieter Abbeel's Foundations of Deep RL series for an overview or Lecture 6: Policy Gradients in Sergey Levine's Deep RL course for a detailed account.

TRPO: Trust Region Policy Optimisation

TRPO was introduced by Schulman et al in a 2015 paper; see also Schulman's thesis. The 'trust region' constrains how far away the new policy can be from the current one. In doing so, it enables some local approximations which make the optimisation easier.

Again, an overview is given in Pieter Abbeel's Lecture 4: TRPO and PPO and a detailed account in Servey Levine's Lecture 9: Advanced Policy Gradients.

Preliminaries

Before proceeding formally, we set up some notation. For simplicity, we assume an infinite time horizon: H=H = \infty. We assume that the initial state S0p0S_0 \sim p_0 and there is some transition kernel pp such that St+1p(s,a)S_{t+1} \sim p(\cdot \mid s, a) given (St,At)=(s,a)(S_t, A_t) = (s, a), independent of the policy.

Given a policy π\pi, denote the state–action value function

Qπ(s,a)=E(St,At)t0π[t0γtr(St,At,St+1)(S0,A0)=(s,a)],Q_\pi(s, a) \textstyle = \mathbb E_{(S_t, A_t)_{t\ge0} \sim \pi}\bigl[ \sum_{t\ge0} \gamma^t r(S_t, A_t, S_{t+1}) \bigm| (S_0, A_0) = (s, a) \bigr],

the expected discounted (future) reward given the current state is ss and action aa is taken. Then, the value function

Vπ(s)=EAπ(s)[Qπ(s,A)]V_\pi(s) = \mathbb E_{A \sim \pi(\cdot \mid s)}[ Q_\pi(s, A) ]

is the average value wrt π\pi. The advantage wrt π\pi quantifies the benefit of choosing action aa over following π\pi on average:

Aπ(s,a)=Qπ(s,a)Vπ(s)=Qπ(s,a)EAπ(s)[Qπ(s,a)]=EAπ(s)[Qπ(s,a)Qπ(s,A)].\begin{aligned} \mathcal A_\pi(s, a) &= Q_\pi(s, a) - V_\pi(s) \\ &= Q_\pi(s, a) - \mathbb E_{A \sim \pi(\cdot \mid s)}[ Q_\pi(s, a) ] = \mathbb E_{A \sim \pi(\cdot \mid s)}[ Q_\pi(s, a) - Q_\pi(s, A) ]. \end{aligned}

and the expected discounted reward is

R(π)=ESp0[Vπ(S)]=ESp0,Aπ(S)[Qπ(S,A)]=E(St,At)t0π[t0γtr(St,At)].\begin{aligned} R(\pi) = \mathbb E_{S \sim p_0}[ V_\pi(S) ] = \mathbb E_{S \sim p_0, A \sim \pi(\cdot \mid S)}[ Q_\pi(S, A) ] \textstyle = \mathbb E_{(S_t, A_t)_{t\ge0} \sim \pi}[ \sum_{t\ge0} \gamma^t r(S_t, A_t) ]. \end{aligned}

First-Order Simplification

We can use the 'nested' nature of the sums to obtain an expression for the change in expected reward when changing from policy π\pi to π\pi':

R(π)R(π)=E(St,At)t0π[t0γtAπ(St,At)].\textstyle R(\pi') - R(\pi) = \mathbb E_{(S'_t, A'_t)_{t\ge0} \sim \pi'}\bigl[ \sum_{t\ge0} \gamma^t A_\pi(S'_t, A'_t) \bigr].

Indeed, the state–action value function satisfies

Qπ(s,a)=ESp(s,a)[r(s,a,S)+γVπ(S)],Q_\pi(s, a) = \mathbb E_{S \sim p(\cdot \mid s, a)}[ r(s, a, S) + \gamma V_\pi(S) ],

where p(s,a)p(\cdot \mid s, a) denotes the one-step probabilities from state ss choosing action aa. In particular, if (St,At)t0π(S'_t, A'_t)_{t\ge0} \sim \pi', then St+1p(St,At)S'_{t+1} \sim p(\cdot \mid S'_t, A'_t) given (St,At)(S'_t, A'_t). Hence,

Eπ[t0γtAπ(St,At)]=Eπ[t0γt(r(St,At,St+1)+γVπ(St+1)Vπ(St))]=Eπ[t0γtr(St,At,St+1)Vπ(S0)]=R(π)R(π),\begin{aligned} \textstyle \mathbb E_{\pi'}\bigl[ \sum_{t\ge0} \gamma^t \mathcal A_\pi(S'_t, A'_t) \bigr] &= \textstyle \mathbb E_{\pi'}\bigl[ \sum_{t\ge0} \gamma^t \bigl( r(S'_t, A'_t, S'_{t+1}) + \gamma V_\pi(S'_{t+1}) - V_\pi(S'_t) \bigr) \bigr] \\ &= \textstyle \mathbb E_{\pi'}\bigl[ \sum_{t\ge0} \gamma^t r(S'_t, A'_t, S'_{t+1}) - V_\pi(S'_0) \bigr] = R(\pi') - R(\pi), \end{aligned}

by the telescoping nature of the sum, recalling that S0,S0p0S_0, S_0' \sim p_0—ie, for both π\pi and π\pi'.

We can rewrite this equation as a sum over states, rather than timesteps. To this end, for a policy π\pi, let

ρπ(s)=t0γtPπ(St=s),\textstyle \rho_\pi(s) = \sum_{t\ge0} \gamma^t \mathbb P_\pi(S_t = s),

the discounted occupation measure under π\pi; to emphasise SρπS \sim \rho_\pi means P(S=s)ρπ(s)\mathbb P(S = s) \propto \rho_\pi(s). Note that

sρπ(s)=t0γt=1/(1γ).\textstyle \sum_s \rho_\pi(s) = \sum_{t\ge0} \gamma^t = 1/(1 - \gamma).

Then, we can write the increment

R(π)R(π)=t0sPπ(St=s)aπ(as)Aπ(s,a)=sρπ(s)aπ(as)Aπ(s,a)=sρπ(s)EAπ(s)[Aπ(s,A)]=11γESρπ,Aπ(S)[Aπ(S,A)];\begin{aligned} R(\pi') - R(\pi) &\textstyle = \sum_{t\ge0} \sum_{s'} \mathbb P_{\pi'}(S'_t = s') \sum_{a'} \pi'(a' \mid s') \mathcal A_\pi(s', a') \\&\textstyle = \sum_{s'} \rho_{\pi'}(s') \sum_{a'} \pi'(a' \mid s') \mathcal A_\pi(s', a') \\&\textstyle = \sum_{s'} \rho_{\pi'}(s') \mathbb E_{A' \sim \pi'(\cdot \mid s')}[ \mathcal A_\pi(s', A') ] \\&\textstyle = \tfrac1{1-\gamma} \mathbb E_{S' \sim \rho_{\pi'}, A' \sim \pi'(\cdot \mid S')}[ \mathcal A_\pi(S', A') ]; \end{aligned}

in other words, sample a state SρπS' \sim \rho_{\pi'} according to the π\pi'-occupation measure ρπ\rho_{\pi'} and an action Aπ(S)A' \sim \pi'(\cdot \mid S') for this state according to π\pi', then take the expectation.

The sum aπ(as)Aπ(s,a)\sum_{a'} \pi'(a' \mid s) \mathcal A_\pi(s', a') is a relatively simple function of π\pi'. Eg, if π=(1ε)π+εν\pi' = (1 - \varepsilon) \pi + \varepsilon \nu, then

aπ(as)Aπ(s,a)=εaν(as)Aπ(s,a),\textstyle \sum_{a'} \pi'(a' \mid s') \mathcal A_\pi(s', a') = \varepsilon \sum_{a'} \nu(a' \mid s') \mathcal A_\pi(s', a'),

since EAν(s)[Aν(s,a)]=0\mathbb E_{A \sim \nu(\cdot \mid s)}[\mathcal A_\nu(s, a)] = 0 for any ν\nu—"the advantage of a measure over itself is 00". However, the dependence of ρπ(s)\rho_{\pi'}(s') on π\pi' is far more complex.

We plan to do small updates ππ\pi \to \pi', by gradient ascent, or similar. As such, we simply replace ρπ\rho_{\pi'} by ρπ\rho_\pi:

R~π(π)=R(π)+11γΔπ(π)\widetilde R_\pi(\pi') = R(\pi) + \tfrac1{1-\gamma} \Delta_\pi(\pi')

where

Δπ(π):=(1γ)sρπ(s)EAπ(s)[Aπ(s,A)]=ESρπ,Aπ(s)[Aπ(S,A)].\textstyle \Delta_\pi(\pi') := (1 - \gamma) \sum_s \rho_\pi(s) \mathbb E_{A' \sim \pi'(\cdot \mid s)}[ \mathcal A_\pi(s, A') ] = \mathbb E_{S \sim \rho_\pi, A' \sim \pi'(\cdot \mid s)}[ \mathcal A_\pi(S, A') ].

We now parametrise the policy π\pi by a vector θ\theta, and overload notation; eg, Rθ(θ)=Rπθ(πθ)R_\theta(\theta') = R_{\pi_\theta}(\pi_{\theta'}). Then,

R~θ0(θ0)=Rθ0(θ0)andθR~θ0(θ)θ=θ0=R(θ)θ=θ0.\widetilde R_{\theta_0}(\theta_0) = R_{\theta_0}(\theta_0) \quad\textsf{and}\quad \nabla_\theta \widetilde R_{\theta_0}(\theta) |_{\theta = \theta_0} = \nabla R(\theta) |_{\theta = \theta_0}.

Indeed, ρπθρπθ\rho_{\pi_\theta} \to \rho_{\pi_{\theta'}} does not change the leading order as θθ\theta' \to \theta, and it already appears in the 'derivative' (difference) part. So, a sufficiently small step πθπθ\pi_{\theta} \to \pi_{\theta'} that improves R~θ=R~πθ\widetilde R_\theta = \widetilde R_{\pi_\theta} also improves the original RR. But, it does not give any guidance on how large a step can be taken.

Constrained Optimisation

One potential avenue is to use a conservative mixture: let π=argmaxπLπ(π)\pi_\star = \mathop{\textsf{arg}}\mathop{\textsf{max}}_{\pi_\star} L_\pi(\pi_\star) and define π:=απ+(1α)π\pi' := \alpha \pi_\star + (1 - \alpha) \pi. Then, Kakade & Langford (2002) derived the lower bound

R(π~)R~π(π~)2εγ(1γ)2α2whereε=maxsEAπ(s)[Aπ(s,A)].\textstyle R(\widetilde \pi) - \widetilde R_\pi(\widetilde \pi) \ge - 2 \varepsilon \gamma (1 - \gamma)^{-2} \alpha^2 \quad\textsf{where}\quad \varepsilon = \mathop{\textsf{max}}_s | \mathbb E_{A' \sim \pi'(\cdot \mid s)}[\mathcal A_\pi(s, A')] |.

This policy class is unwieldy and restrictive in practice. In fact, the proof can be extended to show, for any policy π\pi', that

R(π)R~(π)4εγ(1γ)2KL(ππ)whereKL(ππ)=maxsKL(π(s)π(s)).R(\pi') - \widetilde R(\pi') \ge - 4 \varepsilon' \gamma (1 - \gamma)^{-2} \mathsf{KL}^\star(\pi \mathrel{\|} \pi') \quad\textsf{where}\quad \textstyle \mathsf{KL}^\star(\pi \mathrel{\|} \pi') = \mathop{\textsf{max}}_s \mathsf{KL}\bigl( \pi(\cdot \mid s) \bigm\| \pi'(\cdot \mid s ) \bigr).

This lower bound on R(π)R~(π)R(\pi') - \widetilde R(\pi') tends to be pessimistic in practice. An alternative approach, inspired by this, is to use a constrained optimisation problem: given π,maximiseLπ(π)over policies πsubject toKL(ππ)δ.\textsf{given $\pi$}, \quad \textsf{maximise} \quad L_\pi(\pi') \quad \textsf{over policies $\pi'$} \quad \textsf{subject to} \quad \mathsf{KL}^\star(\pi \mathrel{\|} \pi') \le \delta.

However, this imposes a KL-constraint at every state ss, making it impractical. Instead, we replace this by an average KL-divergence:

KL(ππ)=ESρπ[KL(π(S)π(S))].\overline{\mathsf{KL}}(\pi \mathrel{\|} \pi') = \mathbb E_{S \sim \rho_\pi}\bigl[ \mathsf{KL}\bigl( \pi(\cdot \mid S) \bigm\| \pi'(\cdot \mid S ) \bigr) \bigr].

It turns out (see below) that this average is much easier to estimate. We use it in our constraint:

given π,maximiseLπ(π)over policies πsubject toKL(ππ)δ.\textsf{given $\pi$}, \quad \textsf{maximise} \quad L_\pi(\pi') \quad \textsf{over policies $\pi'$} \quad \textsf{subject to} \quad \overline{\mathsf{KL}}(\pi \mathrel{\|} \pi') \le \delta.

Such a constrained optimisation problem can be (approximately) solved efficiently using a conjugate gradient; see Appendix C Efficiently Solving the Trust-Region Constrained Optimization Problem from the original TRPO paper or the Solving KL Penalized Problem chapter from John Schulman's lecture on policy gradients, TRPO and PPO for more details.

Importance Sampling

Using importance sampling, we can replace a law π\pi' by π\pi:

EAπ(s)[Aπ(s,A)]=EAπ(s)[π(As)π(As)Aπ(s,A)].\mathbb E_{A' \sim \pi'(\cdot \mid s)}[ \mathcal A_\pi(s, A') ] = \mathbb E_{A \sim \pi (\cdot \mid s)}\bigl[ \tfrac{\pi'(A \mid s)}{\pi(A \mid s)} \mathcal A_\pi(s, A) \bigr].

Let's parametrise the policies, overloading notation:

Δθ(θs):=Δπθ(πθs):=EAπθ(s)[πθ(As)πθ(As)Aπθ(s,A)];\Delta_\theta(\theta' \mid s) := \Delta_{\pi_\theta}(\pi_{\theta'} \mid s) := \mathbb E_{A \sim \pi_\theta(\cdot \mid s)}\bigl[ \tfrac{\pi_{\theta'}(A \mid s)}{\pi_\theta(A \mid s)} \mathcal A_{\pi_\theta}(s, A) \bigr];

then, by the chain rule,

θΔθ(θs)θ=θ=EAπθ(s)[θlogπθ(As)Aπθ(s,A)].\nabla_{\theta'} \Delta_\theta(\theta' \mid s) |_{\theta' = \theta} = \mathbb E_{A \sim \pi_\theta(\cdot \mid s)}\bigl[ \nabla_\theta \log \pi_\theta(A \mid s) \mathcal A_{\pi_\theta}(s, A) \bigr].

So, the gradient at the current policy is the same as in the usual policy gradient.

Sample-Based Estimation of the Objective and Constraint

The importance sampling also means that we need only evaluate π(s)\pi'(\cdot \mid s) at a (random) location AA, not sample from π(s)\pi'(\cdot \mid s). Instead, we just need to sample SρπS \sim \rho_\pi and then Aπ(S)A \sim \pi(\cdot \mid S) given SS. This enables us to estimate the expectation of

R~π(π)=R(π)+11γESρπ,Aπ(S)[π(AS)π(AS)Aπ(S,A)],\widetilde R_\pi(\pi') = R(\pi) + \tfrac1{1-\gamma} \mathbb E_{S \sim \rho_\pi, A \sim \pi(\cdot \mid S)}\bigl[ \tfrac{\pi'(A \mid S)}{\pi(A \mid S)} \mathcal A_\pi(S, A) \bigr],

by simply averaging (s,a)A~π,π(s,a)=π(as)π(as)Aπ(s,a)(s, a) \mapsto \widetilde{\mathcal A}_{\pi, \pi'}(s, a) = \tfrac{\pi'(a \mid s)}{\pi(a \mid s)} \mathcal A_\pi(s, a) over rollouts (St(i),At(i))t0iidπ(S^{(i)}_t, A^{(i)}_t)_{t\ge0} \sim^\mathsf{iid} \pi (i=1,...,Mi = 1, ..., M, say) under the initial measure π\pi. This even enables evaluation of multiple candidate π\pi'-s from the same collection of rollouts. When optimising over π\pi', we may ignore the additive R(π)R(\pi) term.

We can estimate the KL term in the constraint via similar rollouts. A naive method samples many actions per (observed) state to estimate the KL divergence KL(π(s)π(s))\mathsf{KL}\bigl( \pi(\cdot \mid s) \bigm\| \pi'(\cdot \mid s ) \bigr) directly. A smarter method expands the KL:

KL(ππ)=ESρπ[aπ(aS)logπ(aS)π(aS)]=ESρπ,Aπ(S)[logπ(AS)π(AS)].\textstyle \overline{\mathsf{KL}}(\pi \mathrel{\|} \pi') = \mathbb E_{S \sim \rho_\pi}\bigl[ \sum_a \pi(a \mid S) \log \frac{\pi'(a \mid S)}{\pi(a \mid S)} \bigr] = \mathbb E_{S \sim \rho_\pi, A \sim \pi(\cdot \mid S)}\bigl[ \log \frac{\pi'(A \mid S)}{\pi(A \mid S)} \bigr].

This can be estimated via rollouts under π\pi as before. In fact, we likely even collected the ratios π(AS)π(AS)\frac{\pi'(A \mid S)}{\pi(A \mid S)} during the estimation of the objective.

Last, we could actually use a different distribution in the importance sampling. Using Aπ(s)A \sim \pi(\cdot \mid s) is perhaps the simplest option, and referred to as single path in Section 5.1 of the original TRPO paper; a vine method, which requires being able to restore the system to particular previous states, is discussed in Section 5.2 there.

Using Gradient Ascent and the Fisher Information

Alternatively, the KL can be estimated more 'directly' using the Fisher information matrix FF:

KL(πθπθ)12(θθ)F(θθ)whereF=Eπθ[θlogπθ(AS)θlogπθ(AS)].\overline{\mathsf{KL}}(\pi_{\theta'} \mathrel{\|} \pi_\theta) \approx \tfrac12 (\theta' - \theta)^\top F (\theta' - \theta) \quad\textsf{where}\quad F = \mathbb E_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(A \mid S) \nabla_\theta \log \pi_\theta(A \mid S)^\top ].

The Fisher matrix is just an expectation, so can be estimated with (the same) samples. This leading-order approximation can be combined with a linearisation of the objective:

replaceΔθ(θ)=ESρπθ,Aπθ(S)[πθ(AS)πθ(AS)Aπθ(S,A)]withθΔθ(θ)(θθ),\textsf{replace} \quad \Delta_\theta(\theta') = \mathbb E_{S \sim \rho_{\pi_\theta}, A \sim \pi_\theta(\cdot \mid S)}\bigl[ \tfrac{\pi_{\theta'}(A \mid S)}{\pi_\theta(A \mid S)} \mathcal A_{\pi_\theta}(S, A) \bigr] \quad\textsf{with}\quad \nabla_{\theta'} \Delta_\theta(\theta')^\top (\theta' - \theta),

and recall that θΔθ(θ)θ=θ=θR(θ)θ=θ\nabla_{\theta'} \Delta_\theta(\theta') |_{\theta' = \theta} = \nabla_\theta R(\theta) |_{\theta' = \theta}. The resulting constrained gradient ascent problem, namely

maximiseθΔθ(θ)(θθ)over parameters θsubject to12(θθ)F(θθ)δ\textsf{maximise} \quad \nabla_{\theta'} \Delta_\theta(\theta')^\top (\theta' - \theta) \quad \textsf{over parameters $\theta'$} \quad \textsf{subject to} \quad \tfrac12 (\theta' - \theta)^\top F (\theta' - \theta) \le \delta

can then be solved exactly:

θ=θ+αF1θR(θ)whereα=2δ/θR(θ)FθR(θ).\theta' = \theta + \alpha F^{-1} \nabla_\theta R(\theta) \quad\textsf{where}\quad \alpha = \sqrt{2 \delta / \nabla_\theta R(\theta)^\top F \nabla_\theta R(\theta) }.

This is detailed in Sergey Levine's Lecture 9, Part 4.

Estimated Advantage

We may not have access to the advantage Aπ(s,a)\mathcal A_\pi(s, a), and need to replace it with an estimator A^π(s,a)\widehat{\mathcal A}_\pi(s, a). There are a couple of main options.

Method Bias Variance Sample Efficiency
MC Low High Low
A3C High Low High
GAE Tunable Tunable Balanced

The time-tt reward is

G=r0+γr1+γ2r2+.G = r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots.

Monte Carlo (MC) simply averages rollouts, using the full trajectory: GMC=GG^\textsf{MC} = G; this makes it slow. Also, the rewards may vary significantly in space, leading to a high-variance estimate.

Asynchronous Advantage Actor–Critic (A3C) goes in the other direction:

GA3C(k)=r0+γr1++γkrk+γk+1V(sk+1),G^{\textsf{A3C}(k)} = r_0 + \gamma r_1 + \ldots + \gamma^k r_k + \gamma^{k+1} V(s_{k+1}),

where sk+1s_{k+1} is the time-(k+1)(k+1) state and VV is the value function—which itself may be being learnt. An estimator of this has reduced variance, since it is only looking a few steps into the future.

Finally, Generalised Advantage Estimation (GAE) takes an exponential weighting of the A3C(k)\textsf{A3C}(k):

GGAE(λ)=(1λ)k0λkGA3C(k)==k0(γλ)k(rk+γ(1λ)V(sk+1)).\textstyle G^{\textsf{GAE}(\lambda)} = (1 - \lambda) \sum_{k\ge0} \lambda^k G^{\textsf{A3C}(k)} = \ldots = \sum_{k\ge0} (\gamma \lambda)^k \bigl( r_k + \gamma (1-\lambda) V(s_{k+1}) \bigr).

Equivalently, GGAE(λ)=EKGeom0(λ)[GA3C(K)]G^{\textsf{GAE}(\lambda)} = \mathbb E_{K \sim \mathop{\textsf{Geom}}_0(\lambda)}[ G^{\textsf{A3C}(K)} ]. This interpolates between A3C(0) (λ=0\lambda = 0) and Monte Carlo (λ=1\lambda = 1).

PPO: Proximal Policy Optimisation

PPO moves the KL condition from the constraint to the optimisation.

The method of Lagrange multipliers actually says that, given δ\delta, there is some β\beta so that the optimality points agree (ie, same optimiser).

In practice, it can be difficult to choose the hyperparameter β\beta. One adaptive option uses dual descent updates for β\beta, to try to match it to the original constraint δ\delta. Iterate

See Sergey Levine's Lecture 9, Part 2 for more details.

A popular extension adjusts the objective by clipping the probability ratio rπ(π):=π(as)/π(as)r_\pi(\pi') := \pi'(a \mid s) / \pi(a \mid s), hiding the dependence on (s,a)(s, a):

previously,Eπ[rπ(π)Aπ];now,E[min{rπ(π)Aπ,clip(rπ(π),1ε,1+ε)Aπ}].\textsf{previously,} \quad \mathbb E_\pi[ r_\pi(\pi') \mathcal A_\pi ]; \qquad \textsf{now,} \quad \mathbb E[ \mathop{\textsf{min}}\{ r_\pi(\pi') \mathcal A_\pi, \mathop{\textsf{clip}}(r_\pi(\pi'), 1 - \varepsilon, 1 + \varepsilon) \mathcal A_\pi \} ].

The clipping stops the probability ratio rπ(π)r_\pi(\pi') from changing too much. The objective is the pessimistic minimum of this clipping and the previous, unclipped version. The clipping has no effect to first order around the current policy—ie, if ππ\pi' \approx \pi. It is really there to stop overly large updates from happening, which was observed without this clipped term in the minimum.

Thinking about positive advantage for the moment, we want to increase the probability ratio. Once the clipping 1+ε1 + \varepsilon is achieved, changing π\pi' cannot increase the ratio. So, there is nothing to be gained beyond this point. An analogous statement holds for pushing the ratio down when the advantage is negative. This restricts the influence for each state–action pair.

This tends to work a bit better than TRPO on continuous control (eg, robotics) and much better on discrete problems (eg, Atari).

GRPO: Group Relative Policy Optimisation

General Method

The "critic" model, required to estimate the value of actions, in RL algorithms like PPO requires training. This training can be resource intensive. GRPO removes the need for a critic model, instead sampling a "group" of actions and comparing their relative reward. This does not take into account future rewards (ie, the value function at the new state), which the critic in PPO does. Further, it is an off-policy method: there is a fixed distribution μ\mu over states (or questions) from which the states are drawn.

More precisely, instead of drawing a single action Aπ(s)A \sim \pi(\cdot \mid s), many (iid) actions A1,...,AGiidπ(s)A_1, ..., A_G \sim^\mathsf{iid} \pi(\cdot \mid s) are drawn. The ii-th advantage A^i\widehat{\mathcal A}_i (ie, of AiA_i) is calculated relative to (A1,...,AG)(A_1, ..., A_G). For example, one option is to take

A^i=(rimeanr)/stdr,\widehat{\mathcal A}_i = (r_i - \mathop{\textsf{mean}} r) / \mathop{\textsf{std}} r,

where rir_i is the (random) reward received after taking action AiA_i; naturally,

meanr=1Gi=1Griandstdr=1Gi=1G(rimeanr).\textstyle \mathop{\textsf{mean}} r = \frac1G \sum_{i=1}^G r_i \quad\textsf{and}\quad \mathop{\textsf{std}} r = \sqrt{ \frac1G \sum_{i=1}^G (r_i - \mathop{\textsf{mean}} r) }.

Additionally, a KL penalisation term is included in the objective, between the new policy π\pi' and a reference policy πref\pi_\textsf{ref}. This is not the old policy πθ\pi_\theta; DeepSeek suggest it is usually the initial supervised fine-tune model—ie, the one before GRPO RL is applied.

Altogether, the final surrogate objective, to be maximised, is

LπGRPO(π)=ESμ,A1,...,AGiidπ(S)[1Gi=1G(min–clip(riA^i)βKL(ππref))]\textstyle L^\textsf{GRPO}_\pi(\pi') = \mathbb E_{S \sim \mu, A_1, ..., A_G \sim^\mathsf{iid} \pi(\cdot \mid S)}\bigl[ \tfrac1G \sum_{i=1}^G \bigl( \mathop{\textsf{min--clip}}(r_i \widehat{\mathcal A}_i) - \beta \mathsf{KL}(\pi' \mathrel{\|} \pi_\textsf{ref}) \bigr) \bigr]

where ri:=π(AS)/π(AS)r_i := \pi'(A \mid S) / \pi(A \mid S) and A^i\widehat{\mathcal A}_i is the (relative) advantage, as before; here, min–clip(x):=min{x,clip(x,1+ε,1ε)}\mathop{\textsf{min--clip}}(x) := \min\{x, \mathop{\textsf{clip}}(x, 1 + \varepsilon, 1 - \varepsilon)\}. Importantly, the whole sum must be inside the expectation, since the (relative) advantage is calculated based on (A1,...,AG)(A_1, ..., A_G). Notice how SμS \sim \mu, not a measure which depends on the current policy π\pi (or the proposed π\pi').

We (unnecessarily) left the KL inside the expectation, and even the summation. This is because, rather than calculate the KL exactly, the following unbiased estimator is typically used:

KL^i:=riref1logriref0whereriref:=πref(AiS)/π(AiS).\textstyle \widehat{\mathsf{KL}}_i := r^\textsf{ref}_i - 1 - \log r^\textsf{ref}_i \ge 0 \quad\textsf{where}\quad r^\textsf{ref}_i := \pi_\textsf{ref}(A_i \mid S) / \pi(A_i \mid S).

The clipping prevents the new policy π\pi' from being too far from the old policy π\pi, and this final KL penalty prevents any policy used from being too far from the original (reference) policy πref\pi_\textsf{ref}.

Noteably, GRPO does not consider any future rewards—it would need a critic (or some other addition) to do that. This makes is suitable for Question–Answer models, which is the context it was introduced in. Such scenarios are special because there are no 'onward states' beyond the answer, and hence no future rewards.

Set-Up for Question–Answer Models

Question–Answer models can, and frequently do, have reasoning traces (interpreted broadly). In particular, given a question qq, an action AA may generate output O1O_1, ..., OmO_m, where mm is the (enforced) length of the reasoning. In this case, and individual reward can (but needn't) be applied to each part of the trace. This leads to a summation of the form

1Gi=1G1Oit=1mmin–clip(ri,tA^i,t)\textstyle \tfrac1G \sum_{i=1}^G \tfrac1{|O_i|} \sum_{t=1}^m \mathop{\textsf{min--clip}}(r_{i,t} \widehat{\mathcal A}_{i,t})

where ri,t=π(oi,tq,oi,1:t1)/π(oi,tq,oi,1:t1)r_{i,t} = \pi'(o_{i,t} \mid q, o_{i, 1:t-1}) / \pi(o_{i,t} \mid q, o_{i, 1:t-1}) and A^i,t\widehat{\mathcal A}_{i,t} is the advantage of the tt-th token in the ii-th trace. A canonical choice is to reward the token based on the final result (there is no need for the reward to be 'Markovian'): A^i,t=A^t.\widehat{\mathcal A}_{i,t} = \widehat{\mathcal A}_t. But, it is possible to use more refined strategies.

Comparison: TRPO/PPO vs GRPO

Training Differences: TRPO/PPO vs GRPO

One important distinction between the TRPO/PPO and GRPO paradigms is the manner in which the (surrogate) objective is estimated.

GRPO really is designed with question–answer models in mind. TRPO/PPO make less sense in this context, since there are no future rewards to discount (a bit like γ=0\gamma = 0), and the trajectories are all single-step.

Potential Extension for GRPO

The lack of a critic model makes the process both faster to train, but also (often, more importantly) more memory efficient. The critic may often be as large as the modle being trained. However, it is my feeling that it is really the lack of future rewards which removes the need for a critic, rather than the use of relative rewards:

The sample means and standard deviations are proxies for the true means and standard deviations under the current policy. These may be quite crude estimators when the number GG of samples is small—eg, G=4G = 4. Potentially, a better estimate on these—eg, by using a large number of samples—may improve performance.

However, increasing GG may hurt performance elsewhere: the (surrogate) objective is an average, so the signal may decrease, by some type of concentration/LLN. One option could be to increase GG and divide by G\sqrt G instead of GG.