LLMs Are Singled-Threaded Reasoners: Demystifying the Working Mechanisms of Soft Thinking

LLMs Are Singled-Threaded Reasoners: Demystifying the Working Mechanisms of Soft Thinking

Investigation

High-Level Summary

Elevator Pitch

Latent reasoning has taken off recently - "reason over concepts not language", "explore multiple paths simultaneously", "reflects human thinking", etc. Performance gains are often reported but, for the training-free approaches particularly, the improvements are typically minimal. Moreover, they are often not robust, helping on some benchmarks and hindering on others - with no discernable (or, discerned) reason.

The current work dives deep into Soft Thinking:

They provide evidence against the claim of increased diversity and parallel reasoning; their evidence suggests LLMs are single-threaded reasoners, and soft thinking actually makes the sampling more greedy, with paths stemming from the non-top token typically terminated in a couple of steps.

Illustration of Soft Thinking probabilities

The authors suggest reintroducing randomness via the Gumbel softmax trick - actually, a pretty smart application of it.

Background

The Soft Thinking framework is briefly and informally described here; see my summary or the original paper for more details.

Given an input xx and generated tokens y<ty_{< t}, the LLM predicts the next token yty_t:

ytLLM(x,y<t)ΔV1,y_t \sim \mathsf{LLM}(x, y_{< t}) \in \Delta^{|V| - 1},

where VV is the vocabulary and Δ\Delta the simplex. The (embedding of the) soft token is defined as the expectation

st:=i=1VeiP(yt=tix,y<t)\textstyle \mathsf{st} := \sum_{i=1}^{|V|} e_i \mathbb P(y_t = t_i \mid x, y_{< t})

where eie_i is the embedding of token tit_i, the ii-th in the vocabulary VV. Typically, a top-{kk/pp} truncation (and renormalisation) is used.

The paper conflates the distribution and expectation frequently, but without causing significant confusion.

Investigation

Experimental Results

Soft Thinking is evaluated across seven benchmarks and three 32B models (DeepSeek-R1-Distill-Qwen-32B, QwQ-32B and Skywork-OR1-32B), with generation length capped at 32,768 tokens. The authors find that, almost universally, Soft Thinking performs worse than traditional sampling—and even performs worse than greedy 'sampling', in which the top token is chosen deterministically.

Table of results comparing Soft Thinking with traditional and greedy sampling

Case Study

There is certainly some evidence that Soft Thinking actually behaves in a pretty greedy manner. Instead of spreading out the distribution, it actually focuses it on the high-probability paths. This is really counter to the narrative pushed by those advocating for such approaches. The following figure shows top 95% of tokens (I think). Invariably, they exhibit semantic coherence only with the dominant token from the previous step:

This leads the authors to hypothesis:

They evidence this further with analysis of the Jensen–Shannon (JS) divergence between probability distributions; see their §4.2 for more details.

Output entropy/token probability vs JS-divergence

The authors refer to this greedy-like behaviour as the greedy pitfall.

Mixture of Inputs

A related method is Mixture of Inputs (MoI). This randomises the probability distribution by sampling a token and taking a mixture of the one-hot encoded sampled token and the soft token (expectation).

However, whilst MoI isn't references in the current paper, it only makes the probability distribution more concentrated around its top token. So, it seems likely that MoI suffers from similar issues to vanilla Soft Thinking.

Addressing the Greedy Pitfall

The key aspect is to use randomised soft thinking. The authors consider Dirichlet Sampling and the Gumbel Softmax trick; for simplicity, we focus on the latter here. The key idea is to add noise to the logits (i)iV(\ell_i)_{i \in V}:

instead of sampling according to πiei/τ\pi_i \propto e^{\ell_i/\tau}, use π~ie(gi+i)/τ\tilde \pi_i \propto e^{(g_i + \ell_i)/\tau} where g1,...,gViidG(0,1)g_1, ..., g_{|V|} \sim^\mathsf{iid} G(0,1), the unit Gumbel distribution.

The reason for using the Gumbel distribution is the following invariance. Let II/I~\tilde I denote the usual/Gumbel softmax; ie, P(I=i)=πi\mathbb P(I = i) = \pi_i and P(I~=ig1,...,gV)=π~i\mathbb P(\tilde I = i \mid g_1, ..., g_{|V|}) = \tilde \pi_i, which is a random variable. Then, averaging over the Gumbel noise,

P(I~=i)=E[P(I~=ig1,...,gV)]=E[π~i]=πi=P(I=i).\mathbb P(\tilde I = i) = \mathbb E[\mathbb P(\tilde I = i \mid g_1, ..., g_{|V|})] = \mathbb E[\tilde \pi_i] = \pi_i = \mathbb P(I = i).

[Author note: I haven't checked this myself, but presumably it isn't difficult. The result is cited in this paper, and dates back to 2017.]

So, from a sampling point of view, nothing has changed. But, from a soft thinking point of view, we now have a new, randomised probability distribution.

Probability-weighted paths in soft thinking

This addresses the greedy collapse. Consider Figure 1, in which both "so" and "let" receive significant probability mass: 56% and 44% respectively. The higher-weighted "so" dominates in that figure, but if they get Gumbel noise, it's reasonably likely the noisy version of "let" is then the higher-weighted. This enables exploration of many more paths.

The table below demonstrates improved performance, particularly for the Gumbel variant, of randomised Soft Thinking over both vanilla Soft Thinking and vanilla sampling.

Table of results for randomised soft thinking

The paper isn't definitive, but it seems that *the same temperature is used across all nine benchmarks; the temperature isn't fitted to the specific benchmark.

The authors also discuss balancing softness and randomness. The objectives of this section are less well-defined to me, and I'm not really sure what they're trying to get at. They conclude that Gumbel balances these two better than Dirichlet though; it also performs better in the above table.

Conclusion

The paper makes a strong case for issues surrounding vanilla soft thinking. Their Gumbel-softmax randomisation strategy has much better theoretical grounding and, for example, that used in Mixture of Inputs.

After all the discussion and examples around how vanilla soft thinking has collapse issues, no investigation is given for their randomised versions. Improved performance is demonstrated, suggesting some issues have been resolved. But, a proper investigation is unfortunately missing.

A cynic would suggest that such an investigation is so obviously needed that it's lack of inclusion suggests results did not align with their expectations or desires.