FlowRL Summary

FlowRL Summary

Methodology

High-Level Summary

Elevator Pitch

Mainstream reasoning LLMs are trained with reward-maximising RL methods. These tend to overfit to the dominant reward signal, neglecting all other reasoning paths—model collapse. Such other paths have a lower reward, but may still be valid—eg, simply longer or less elegant. FlowRL aims to minimise the (reverse) KL divergence between the LLM's policy πθ(yx)\pi_\theta(y \mid x) and a target distribution eβr(x,y)\propto e^{\beta r(x, y)}, where r(x,y)r(x, y) is the reward for answering yy to question xx.

The evaluation focusses on maths/code benchmarks and LLM-judged diversity of responses. Improvements in both are observed vs GRPO and PPO. However, it must be noted that there are many variants of GRPO explicitly designed to address diversity issues, such as DAPO, ProRL or GSPO/GMPO.

Distribution-matching vs reward-maximisation

Methodology

Reward Maximisation to Distribution Matching

The objective is to minimise a (reverse) KL divergence between the LLM's policy and a target distribution. Inspired by energy-based modelling, which comes from statistical physics, the target is a Gibbs measure given question xx:

eβr(x,y)/Z(x)e^{\beta r(x, y)} / Z(x), where r(x,y)r(x, y) is the reward, β\beta the inverse temperature and Z(x)Z(x) the partition function.

As always, the partition function is intractable; it is thus parametrised (Z=ZϕZ = Z_\phi) and learned. The objective is then to minimise

minθEXq[KL(πθ(X)eβr(x,y)/Zϕ(x))],()\min_\theta \mathbb E_{X \sim q}\bigl[ \mathsf{KL}\bigl( \pi_\theta(\cdot \mid X) \mathrel{\|} e^{\beta r(x, y)} / Z_\phi(x) \bigl) \bigr], \tag{$\star$}

where qq is a distribution over potential questions. (The expectation over xx is my assumption: it is not explained in the paper.)

This KL objective is approached via the framework of GFlowNets.

GFlowNets: brief summary.

Paraphrasing §2, GFlowNets are a probabilistic framework for training stochastic policies to sample discrete, compositional objects (eg, graphs or sequences) in proportion to a given reward. The core principle is to balance forward and backward probability flows at each state, inspired by flow matching (Bengio et al, 2021).

The technical observation (not knew to FlowRL) is that, in terms of expected gradients, minimising the KL objective in ()(\star) is equivalent to minimising the trajectory balance (TB) loss in GFlowNets. It is stated informally and non-rigorously in the FlowRL paper.

Proposition 1 (informal). In terms of expected gradients, minimising the KL objective in ()(\star) is equivalent to minimising the TB loss used in GFlowNet: minθKL(πθ(yx)eβr(x,y)/Zϕ(x))standard KL divergence    minθ(logZϕ(x)+logπθ(yx)βr(x,y))2trajectory balance in GFlowNets\min_\theta \underbrace{ \mathsf{KL}\bigl( \pi_\theta(y \mid x) \mathrel{\|} e^{\beta r(x, y)} / Z_\phi(x) \bigl) }_{\text{standard KL divergence}} \iff \min_\theta \underbrace{ \bigl( \log Z_\phi(x) + \log \pi_\theta(y \mid x) - \beta r(x, y) \bigr)^2 }_{\text{trajectory balance in GFlowNets}}

Formally, this should be interpreted in the following way.

Proposition 1 (formal). For a question xx and response yy, denote the pointwise KL and TB losses as follows: LθKL(x):=KL(πθ(x)eβr(x,)/Z(x));Lθ,ϕTB(x,y):=(logπθ(yx)+logZϕ(x)βr(x,y))2.\begin{aligned} \mathcal L^\mathsf{KL}_\theta(x) &:= \mathsf{KL}\bigl( \pi_\theta(\cdot \mid x) \mathrel{\|} e^{\beta r(x, \cdot)} / Z(x) \bigr); \\ \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) &:= \bigl( \log \pi_\theta(y \mid x) + \log Z_\phi(x) - \beta r(x, y) \bigr)^2. \end{aligned} Then, the "expected TB gradient" equals the (full) KL gradient: 2θLθKL(x)=EYπθ(x)[θLθ,ϕTB(x,y)].2 \nabla_\theta \mathcal L^\mathsf{KL}_\theta(x) = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[\nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y)]. In other words, to get an unbiased estimator of LθKL(x)\mathcal L^\mathsf{KL}_\theta(x), sample Yπθ(x)Y \sim \pi_\theta(\cdot \mid x) and compute θLθ,ϕTB(x,Y)\nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, Y).

Derivation of formal version of Proposition 1.

Define the KL and trajectory balance (TB) objectives (losses), pointwise wrt questions xx: LθKL(x):=KL(πθ(x)eβr(x,)/Z(x));Lθ,ϕTB(x,y):=(logπθ(yx)+logZϕ(x)βr(x,y))2.\begin{aligned} \mathcal L^\mathsf{KL}_\theta(x) &:= \mathsf{KL}\bigl( \pi_\theta(\cdot \mid x) \mathrel{\|} e^{\beta r(x, \cdot)} / Z(x) \bigr); \\ \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) &:= \bigl( \log \pi_\theta(y \mid x) + \log Z_\phi(x) - \beta r(x, y) \bigr)^2. \end{aligned} Expanding the KL divergence, LθKL(x)=EYπθ(x)[log(πθ(Yx)Z(x)/eβr(x,Y))]=EYπθ(x)[logπθ(Yx)+logZ(x)βr(x,Y)].\begin{aligned} \mathcal L^\mathsf{KL}_\theta(x) &= \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \log\bigl( \pi_\theta(Y \mid x) Z(x) / e^{\beta r(x, Y)} \bigr) \bigr] \\ &= \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \log \pi_\theta(Y \mid x) + \log Z(x) - \beta r(x, Y) \bigr]. \end{aligned}

The law of Yπθ(x)Y \sim \pi_\theta(\cdot \mid x) depends on the parameters θ\theta. Abbreviate δθ(x,y):=logπθ(yx)+logZ(x)βr(x,y),\delta_\theta(x, y) := \log \pi_\theta(y \mid x) + \log Z(x) - \beta r(x, y), which has θδθ(x,y)=θlogπθ(yx).\nabla_\theta \delta_\theta(x, y) = \nabla_\theta \log \pi_\theta(y \mid x).

Start with the KL term, using πθ(yx)θlogπθ(yx)=θπθ(yx)\pi_\theta(y \mid x) \nabla_\theta \log \pi_\theta(y \mid x) = \nabla_\theta \pi_\theta(y \mid x) and yθπθ(yx)=θyπθ(yx)=θ1=0\sum_y \nabla_\theta \pi_\theta(y \mid x) = \nabla_\theta \sum_y \pi_\theta(y \mid x) = \nabla_\theta 1 = 0: θLθKL(x)=yθ(πθ(yx)δθ(x,y))=y(θπθ(yx)δθ(x,y)+πθ(yx)θlogπθ(yx))=yθπθ(yx)(δθ(x,y)+1)=yθπθ(yx)δθ(x,y)=EYπθ(x)[θlogπθ(Yx)δθ(x,Y)].\begin{aligned} \nabla_\theta \mathcal L^\mathsf{KL}_\theta(x) &\textstyle = \sum_y \nabla_\theta \bigl( \pi_\theta(y \mid x) \delta_\theta(x, y) \bigr) \\&\textstyle = \sum_y \bigl( \nabla_\theta \pi_\theta(y \mid x) \cdot \delta_\theta(x, y) + \pi_\theta(y \mid x) \cdot \nabla_\theta \log \pi_\theta(y \mid x) \bigr) \\&\textstyle = \sum_y \nabla_\theta \pi_\theta(y \mid x) \cdot (\delta_\theta(x, y) + 1) \\&\textstyle = \sum_y \nabla_\theta \pi_\theta(y \mid x) \cdot \delta_\theta(x, y) \\&\textstyle = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \nabla_\theta \log \pi_\theta(Y \mid x) \cdot \delta_\theta(x, Y) \bigr]. \end{aligned} This is the policy gradient form. Turning to the TB term, the derivative is taken before the expectation over yy: θLθ,ϕTB(x,y)=θδθ(x,y)2=2θlogπθ(yx)δθ(x,y).\nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) = \nabla_\theta \delta_\theta(x, y)^2 = 2 \nabla_\theta \log \pi_\theta(y \mid x) \cdot \delta_\theta(x, y). Hence, 2θLθKL(x)=2θEYπθ(x)[δθ(x,Y)]=EYπθ(x)[2θlogπθ(Yx)δθ(x,Y)]=EYπθ(x)[θLθ,ϕTB(x,Y)].\begin{aligned} 2 \nabla_\theta \mathcal L^\mathsf{KL}_\theta(x) & = 2 \nabla_\theta \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[ \delta_\theta(x, Y) ] \\& = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[ 2 \nabla_\theta \log \pi_\theta(Y \mid x) \cdot \delta_\theta(x, Y) ] = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[ \nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, Y) ]. \end{aligned}

This squared-loss approach has a key advantage over direct KL optimisation: Zϕ(x)Z_\phi(x) can be treated as a learnable parameter, rather than requiring explicit computation of the intractable partition function Z(x)Z(x). Using a stable squared-loss form is also advantageous practically.

However, the actual value of Zϕ(x)Z_\phi(x) has no effect on the expected gradients. Indeed, θLθ,ϕTB(x,y)=θlogπθ(yx)g(x,y;θ)whereg(x,y;θ):=logπθ(yx)+logZϕ(x)βr(x,y).\nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) = \nabla_\theta \log \pi_\theta(y \mid x) \cdot g(x, y; \theta) \quad\textsf{where}\quad g(x, y; \theta) := \log \pi_\theta(y \mid x) + \log Z_\phi(x) - \beta r(x, y). But, EYπθ(x)[θlogπθ(Yx)]=0\mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[\nabla_\theta \log \pi_\theta(Y \mid x)] = 0. Hence, adding anything to gg that is independent of θ\theta does not affect EYπθ(x)[θLθ,ϕTB(x,y)].\mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[ \nabla_\theta \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) ].

That doesn't mean that Zϕ(x)Z_\phi(x) should be ignored. Rather, it plays the same role as the baseline in standard RL training. The true partition function minimises the variance.

Learning Zφ: gradients and regression.

Looking back at the per-xx KL objective, define Lθ,ϕKL:=EYπθ(x)[logπθ(Yx)+logZϕ(x)βr(x,Y)];\mathcal L^\mathsf{KL}_{\theta, \phi} := \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \log \pi_\theta(Y \mid x) + \log Z_\phi(x) - \beta r(x, Y) \bigr]; then, Lθ,ϕKL=LθKL\mathcal L^\mathsf{KL}_{\theta, \phi} = \mathcal L^\mathsf{KL}_\theta if Zϕ=ZZ_\phi = Z.In order to minimise the variance, Zϕ(x)Z_\phi(x) should be such that logZϕ(x)=EYπθ(x)[βr(x,Y)logπθ(Yx)]=EYπθ(x)[log(eβr(x,Y)/πθ(Yx))].\log Z_\phi(x) = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}[ \beta r(x, Y) - \log \pi_\theta(Y \mid x) ] = \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \log\bigl( e^{\beta r(x, Y)} / \pi_\theta(Y \mid x) \bigr) \bigr]. Interestingly, this is not the true partition function Z(x)Z(x): indeed, logZ(x)=logEYπθ(x)[eβr(x,Y)/πθ(Yx)].\log Z(x) = \log \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ e^{\beta r(x, Y)} / \pi_\theta(Y \mid x) \bigr]. Calculating either exactly is intractable, so least-squares regression is used: minimise(logZϕ(x)βr(x,y)+logπθ(yx))2=Lθ,ϕTB(x,y)wrtϕ.\textsf{minimise} \quad \bigl( \log Z_\phi(x) - \beta r(x, y) + \log \pi_\theta(y \mid x) \bigr)^2 = \mathcal L^\mathsf{TB}_{\theta, \phi}(x, y) \quad\textsf{wrt}\quad \phi.

Taking derivative wrt ϕ\phi, ϕLθ,ϕTB=2(ϕlogZϕ(x)βr(x,y)+logπθ(yx)).\nabla_\phi \mathcal L^\mathsf{TB}_{\theta, \phi} = 2 \bigl( \nabla_\phi \log Z_\phi(x) - \beta r(x, y) + \log \pi_\theta(y \mid x) \bigr). Thus, the following update is used: sample y1,...,yNiidπθ(x)y_1, ..., y_N \sim^\mathsf{iid} \pi_\theta(\cdot \mid x) and replace logZϕ(x)(1η)logZϕ(x)+η1ni=1N(β(r,yi)logπθ(yix)),\textstyle \log Z_\phi(x) \leftarrow (1 - \eta) \log Z_\phi(x) + \eta \cdot \frac1n \sum_{i=1}^N \bigl( \beta(r, y_i) - \log \pi_\theta(y_i \mid x) \bigr), where η(0,1]\eta \in (0, 1] is the learning rate.

Proposition 5 in Appendix B shows the optimisation problem in Proposition 1 is equivalent, in the same sense, to maxθ{EYπθ(x)[βr(x,y)logZϕ(x)]+H(πθ(x))},\max_\theta \big\{ \mathbb E_{Y \sim \pi_\theta(\cdot \mid x)}\bigl[ \beta r(x, y) - \log Z_\phi(x) \bigr] + \mathcal H\bigl( \pi_\theta(\cdot \mid x) \bigr) \big\}, where H(μ):=EXμ[logμ(X)]\mathcal H(\mu) := - \mathbb E_{X \sim \mu}[\log \mu(X)] denotes the entropy of the distribution. So, the FlowRL objective can be interpreted as jointly maximising reward and entropy. This encourages the policy to cover the entire reward-weighted distribution, rather than collapsing to a few high-reward modes.

FlowRL

Proposition 1 is not new to FlowRL. However, its application to long context is new. Previous GFlowNets works typically operate on short trajectores in small, discrete spaces (no citation given). Long CoT reasoning introduces two challenges.

  1. Exploding gradients. The log-probability term logπθ(yx)\log \pi_\theta(y \mid x) decomposes as tlogπθ(yty<t,x)\sum_t \log \pi_\theta(y_t \mid y_{< t}, x), causing the gradient to potentially scale with sequence length.

  2. Sampling mismatch. Mainstream RL algorithms often perform micro-batch updates and reuse trajectories collected from an old policy, for efficiency reasons. Contrastingly, the KL-based trajectory balance (TB) objective assumes fully on-policy sampling, with responses drawn from the current policy.

Neither of these issues is new to RL post-training of LLMs. Exploding gradients are addressed by length normalisation, rescaling as 1ylogπθ(yx)=logπθ(yx)1/y.\tfrac1{|y|} \log \pi_\theta(y \mid x) = \log \pi_\theta(y \mid x)^{1/|y|}. The sampling mismatch is handled with importance sampling in a relatively standard way, alongside some clipping: w=clip(πθ(yx)/πθold(yx),1ε,1+ε)detach,w = \operatorname{clip}\bigl( \pi_\theta(y \mid x) / \pi_{\theta_\mathsf{old}}(y \mid x), 1 - \varepsilon, 1 + \varepsilon \bigr)_\textsf{detach}, where the subscript-"detatch" indactes that the gradient is detatched during its calculation.

Following established practices in the GFlowNets literature, a reference model is incorporated as a prior constraint on the reward distribution: eβr(x,y)is replaced witheβr(x,y)πref(yx).e^{\beta r(x, y)} \quad\textsf{is replaced with}\quad e^{\beta r(x, y)} \cdot \pi_\textsf{ref}(y \mid x). Naturally, Z(x)Z(x) should now be updated; although, as discussed above, its actual value doesn't affect the gradients, only the variance. This adjustment leads to the following minimisation problem: minθ(logZϕ(x)+logπθ(yx)βr(x,y)logπref(yx)).\min_\theta \bigl( \log Z_\phi(x) + \log \pi_\theta(y \mid x) - \beta r(x, y) - \log \pi_\mathsf{ref}(y \mid x) \bigr).

Additionally, as in GRPO, group normalisation is applied to r(x,y)r(x, y): r^i=(rimeanr)/stdr,\hat r_i = (r_i - \operatorname{mean} r) / \operatorname{std} r, where r={r1,...,rG}r = \{r_1, ..., r_G\} denotes a set of rewards within a sampled group.

Substituting all this into Proposition 1 provides the following objective: Lθ,ϕFlowRL(x,y):=w(logZϕ(x)+1ylogπθ(yx)βr^i(x,y)1ylogπref(yx))2.\mathcal L^\mathsf{FlowRL}_{\theta, \phi}(x, y) := w \bigl( \log Z_\phi(x) + \tfrac1{|y|} \log \pi_\theta(y \mid x) - \beta \hat r_i(x, y) - \tfrac1{|y|} \log \pi_\mathsf{ref}(y \mid x) \bigr)^2.

Results

FlowRL is compared with fairly 'vanilla' RL algorithms: REINFORCE++ (2025/01), PPO (2017/07) and GRPO (2024/12). There are modern variants, or tweaks, of GRPO which explicitly address diversity, such as DAPO (2025/03) or ProRL (2025/05). Others, such as GSPO (2025/07) and GMPO (2025/07), address stability. It is highly disappointing that no comparison with these is done—especially as multiple are mentioned in the paper. At least relatively large models are used: 7B and 32B.

Nevertheless, the reported results are included below, for completeness.

Maths and code benchmarks

Diversity