ProRL Summary

ProRL Summary

High-Level Summary

Elevator Pitch

Recent advancements in reasoning have focused on RL-based fine-tuning. A fundamental question remains under debate:

Does RL truly unlock new reasoning capabilities, or does it merely optimise sampling of solutions already learned?

Several recent studies argue the latter, basing their conclusions on pass@kk metrics with large kk. The authors of this paper argue the former, positing that the previous conclusions may stem from methodological constraints, not fundamental limitations of RL.

  1. Overreliance on specialised domains in which the models are often overtrained during both pre- and post-training, restricting potential for RL exploration.
  2. Premature termination of RL training prior to full exploration and development of new reasoning capabilities.

ProRL overview

Methodology

The RL training is based on the GRPO algorithm, with a few tweaks to address entropy collapse. Entropy collapse, where the model's output becomes too concentrated early, causes the policy to commit to a narrow set of outputs prematurely, limiting exploration. This is particularly bad in GRPO, where the learning signal relies on having a diverse set of sampled outputs.

  1. Decoupled Clipping.

    In the original GRPO, the clipping is uniform. Here, the upper and lower thresholds are separated: clip(rθ,1εlow,1+εhigh)\operatorname{clip}(r_\theta, 1 - \varepsilon_\textsf{low}, 1 + \varepsilon_\textsf{high}), where rθr_\theta is the observed policy ratio. Increasing εhigh\varepsilon_\textsf{high} further prompts the model to uplift probabilities of unlikely tokens which, nevertheless, provided significant advantage. They take εlow=0.2\varepsilon_\textsf{low} = 0.2 (the default for ε\varepsilon in many libraries) but εhigh=0.4\varepsilon_\textsf{high} = 0.4. This is pretty high: Magistral take εhigh0.25\varepsilon_\textsf{high} \approx 0.25, for comparison.

  2. Token-Level Policy Gradient Loss.

    GRPO employs a sample-level loss calculation: first, average the losses by token within each sample; then, aggregate the losses across samples. Notationally, G1i=1Goi1t=1oi(...)G^{-1} \sum_{i=1}^G |o_i|^{-1} \sum_{t=1}^{|o_i|} (...). Each sample is assigned equal weight, and so tokens in longer responses are underweighted. DAPO uses a token-level calculation: (i=1Goi)1i=1Gt=1oi(...)(\sum_{i=1}^G |o_i|)^{-1} \sum_{i=1}^G \sum_{t=1}^{|o_i|} (...).

These are taken from DAPO.

  1. KL Regularisation and Reference-Policy Reset.

    The original GRPO algorithm does include a KL penalty term βKL(πθπref)- \beta \operatorname{KL}(\pi_\theta \mathrel{\|} \pi_\textsf{ref}), where πref\pi_\textsf{ref} is a fixed reference policy—typically the pre-RL model. Several recent papers, including Magistral and DAPO, removed this, arguing that the models naturally diverge from the reference policy anyway. As such, as training increases, the KL penalty may dominate the loss.

    ProRL keeps the penalty term. To alleviate domination, the reference policy πref\pi_\textsf{ref} is hard-reset to a more recent snapshot periodically. (The hope is that) this allows continued improvement whilst maintaining the benefits of KL regularisation.

  2. Prolonged RL. Typically RL implementations run for no more than a few hundred steps. Contrastingly, ProRL demonstrated continued improvement beyond 2000 training steps; see Figure 1 (left) above. This gave the algorithm sufficient time to explore and uncover new strategies—so they hypothesise.

Results: Nemotron-Research-Reasoning-Qwen-1.5B

The result is, in their words, "the world's best 1.5B reasoning model"—and that is perhaps justified. In short, Nemotron-Research-Reasoning-Qwen-1.5B outperforms its baseline, DeepSeek-R1-Distill-Qwen-1.5B, by approximately

15% on maths and code, 25% on STEM reasoning, 22% on instruction following and >50% on text-based logic puzzles from Reasoning Gym.

It is comparable with, even outperforms, the much larger DeepSeek-R1-Distill-Qwen-7B.

It also outperforms the domain-specialised baselines of DeepScaleR-1.5B and DeepCoder-1.5B on maths and code, respectively, by around 5%.

Nemotron evaluation

The subcategories in the "Reasoning [Gym]" benchmark are detailed in Table 5 in §F.1. The final three benchmarks in Table 3 (ie, acre, boxnet and game) are out-of-distribution tasks: these were not included in the RL training data.

The training set-up used verl, with εlow=0.2\varepsilon_\textsf{low} = 0.2 and εhigh=0.4\varepsilon_\textsf{high} = 0.4 (as mentioned above), alongside dynamic sampling for filtering too easy/hard questions. For each question, 1616 responses were sampled, with a (high) sampling temperature of 1.21.2 and response length capped at 8k tokens "to maintain concise and stable generations". Evaluation used vllm with a sampling temperature of 0.60.6 and a maximum response length of 32k.

A context window limit of 8k was used throughout most of the training, until the final stage (~200 steps) in which it was increased to 16k tokens. It was observed that the model adapted quickly, with a noticeable increase in response length.

Training dynamics

The dataset consisted of 136k examples. On 4 8xH100-80GB boxes, the whole training took approximately 16k GPU hours. The training dataset and recipe are detailed in §D and §E, respectively.

Analysis: Does ProRL Elicit New Reasoning Patterns?

The above evaluation is the usual benchmarking. It doesn't address the question of "enhance capabilities vs improve sampling". To address this, the results of pass@kk are plotted as a function of kk, from 11 to 256256. Additionally, the final Nemotron model (green) is compared with an intermediary checkpoint (orange) as well as the base model (blue).

pass@k evaluations

Many more examples are plotted in §F.2.

Generally, it was found that the RL-derived improvement was significantly negatively correlated with the baseline performance: little, or even negative, improvement was seen on tasks with a high baseline pass@128, but major improvements when the baseline was poor (even ~0% → ~100%).

Relative improvement

The tasks highlighted in the circle tend to have a low creativity index, indicating a higher overlap with pre-training data, suggesting the model has already seen a lot of similar data during training.

The last aspect we discuss is the out-of-distribution reasoning, which ProRL appears to enhance.