Rethinking Thinking Tokens: LLMs as Improvement Operators

Rethinking Thinking Tokens: LLMs as Improvement Operators

Methodology: LLMs as Improvement Operators

High-Level Summary

Elevator Pitch

Training LLMs to reason incentivises them to output their chain of thought (CoT). This allows exploration of different strategies, and backtracking, but comes at a significant computational cost: their CoTs can be very long, inflating the context length by an order of magnitude and more. This is expensive, both in terms of compute and latency, since it is serialised, not parallelised.

Parallel-Distill-Refine (PDR) addresses this by providing a bounded workspace in which the "thinking" occurs. The workspace does not persist between rounds, so the context does not grow. PDR generates drafts in parallel, distils them to the workspace and then refines the current answer based on this workspace.

The authors measure accuracy against sequential budget, a proxy for latency. When no parallelisation is used in the first step, they term it Sequential Refine (SR).

Accuracy vs Sequential Budget on AIME 2024

Contributions and Findings:

Methodology: LLMs as Improvement Operators

The primary tool introduces is Parallel-Distill-Refine (PDR).

  1. Generate M1M \ge 1 diverse drafts in parallel. (MM has no effect on latency.)
  2. Distil these into a bounded workspace; eg, summarise or pick top-kk drafts.
  3. Refine answer conditional on this workspace, and repeat.

If no parallelisation is used (ie, M=1M = 1), PDR is termed Sequential-Refine (SR).

PDR overview

Problem Setting and Notation

Given a task xx, the objective is to produce a high-quality final artefact/solution sfinals_\textsf{final} under a given token budget. The (frozen or trainable) LLM, used as an improvement operator, is denoted Mθ\mathcal M_\theta.

Given current artefact sts_t and a textual workspace CtC_t, the model refines: st+1:=Mθ(x,st,Ct).s_{t+1} := \mathcal M_\theta(x, s_t, C_t). The workspace CtC_t is a bounded summary (Ctκ|C_t| \le \kappa) meant to capture agreements, contradictions, intermediate results and goals and the like. The workspace is updated via some distilation operator: Ct+1:=D(x,st+1).C_{t+1} := \mathcal D(x, s_{t+1}).

Methods are evaluated under two budgets.

Here, c=1,...,Cc = 1, ..., C indexes all model calls, and inc\text{in}_c/outc\text{out}_c is the input/output tokens for call cc; P{1,...,C}\mathcal P \subseteq \{1, ..., C\} is the final accepted path.

Operator Instantiations

A persistent memory is not maintained. Instead, for rounds r=1,...,Rr = 1, ..., R, MrM_r drafts are sampled in parallel conditioned on the current bounded summary, which resynthesised (distil) a fresh bounded summary: S(r):={si(r):=Mθ(x,C(r1))}i=1Mr;C(r):=D(x,S(r)),C(0):=,C(r)k.\begin{aligned} S^{(r)} &:= \{ s^{(r)}_i := \mathcal M_\theta(x, C^{(r-1)}) \}_{i=1}^{M_r}; \\ C^{(r)} &:= \mathcal D(x, S^{(r)}), \quad C^{(0)} := \varnothing, \quad |C^{(r)}| \le k. \end{aligned} We enforce MR=1M_R = 1, and it returns sfinals_\textsf{final}. To reiterate, the summary is roundwise and non-persistent.

There are multiple ways to construct the roundwise summary.

  1. Global summary: produce a single, shared C(r)C^{(r)} capturing agreements, contradictions, derived facts, unresolved subgoals, next actions and the like.
  2. Top-kk evidence: instead of free-form text, select kk solutions from S(r)S^{(r)} as the workspace itself.
  3. Random-kk bootstrapped: construct multiple small workspaces by randomly sampling kk solutions per generation; different parallel generations get different workspaces.

Operator-Consistent Training

Reasoning models have typically been trained to optimise a single, long CoT trajectory. Using PDR at inference creates a train–test mismatch. This is addressed by using two modes.

  1. Standard, long-trace optimisation.
  2. Operator rollouts that execute generate → distil → refine under shorter contexts.

The base algorithm is CISPO from MiniMax-M1 (2025/06), which is a GRPO variant. To improve stability, they include an SFT loss: J(θ):=JCISPO(θ)+αJSFT(θ)J(\theta) := J_\textsf{CISPO}(\theta) + \alpha J_\textsf{SFT}(\theta) where α:=0.1\alpha := 0.1 and JSFT(θ):=ExDEo1,...,oGiidπθold(x)[1I+iI+1oit=1oilogπθ(oi,tx,oi,<t)].J_\textsf{SFT}(\theta) := \mathbb E_{x \sim \mathcal D} \mathbb E_{o_1, ..., o_G \sim^\textsf{iid} \pi_{\theta_\textsf{old}}(\cdot \mid x)}\biggl[ - \frac1{|\mathcal I^+|} \sum_{i \in \mathcal I^+} \frac1{|o_i|} \sum_{t=1}^{|o_i|} \log \pi_\theta(o_{i, t} \mid x, o_{i, < t}) \biggr]. The CISPO objective ensures the RL explores diverse strategies and learns from both successes and failures. The SFT objective adds stability and strongly reinforces correct solutions, but does not learn from mistakes.

To address the train–test mismatch, a data mixture is used: at each update, the mini-batch B\mathcal B is split evenly into two sub-batches, Btr\mathcal B_\textsf{tr} and Bop\mathcal B_\textsf{op}.

The datamix is designed to preserve competence on long traces whilst also teaching the model to reason across short iterations.

Whilst PDR is only trained with R=1R = 1 round, we can run R>1R > 1 rounds at inference.

Teach someone to fish (R = 1), and they'll be fed for life (R > 1).

Experimental Results

The models gpt-o3-mini and gemini-2.5-flash are evaluated on the AIME '24 & '25 benchmarks. Performance is reported as a function of both the sequential budget BseqB_\textsf{seq} (ie, 'latency') and total budget BtotB_\textsf{tot} (ie, 'cost').

The first figure compares long CoT, PDR & SR at matched sequential budgets on the AIME '24 benchmark.

Long CoT vs PDR vs SR at matched sequential budgets on AIME '24

The difference is starker for GPT-o3 than Gemini 2.5, but this may be more a consequence of its lower baseline vs Gemini 2.5.

The next figure uses only Gemini 2.5, but also compares total budget.

Long CoT vs PDR vs SR for Gemini 2.5 on different budgets

Alas, the lack of a log scale on the horizontal axis makes the visualisation rather cluttered.

Further experiments are given in the paper, all motivated by four research questions/objectives.

  1. Can short-context iterations outperform long traces?
  2. Figure out the best distillation strategy.
  3. Identify the effect of verification ability of a given model.
  4. Can operator-consistent training shift the Pareto frontier.

Conclusion

The bounded-workspace approach certainly offers improved latency at high token counts, since the (sequential) compute is linear, not quadratic. Its parallelisation also appears to offer performance improvements, albeit potentially at a high total cost.