Controlling Thinking Speed in Reasoning Models

Controlling Thinking Speed in Reasoning Models

High-Level Summary

Elevator Pitch

Your mum has probably heard of Thinking, Fast and Slow (Daniel Kahneman, 2011). So has everyone in LLMs, and this is yet another paper with the same theme:

"How can we combine the advantages of both System 1 and System 2 thinking within one model, thus simultaneously achieving both efficiency and accuracy?"

The authors argue that some LRMs intrinsically possess both slow- and fast-thinking abilities. They observe that the slow and fast outputs consistently start with distinct opening words: "okay" or "alright" versus "to" or "first". This provides a built-in switch that can be manually activated. A principal component vv aligning with the "slow → fast" direction is found.

A dynamic reasoning speed control method is used: for each hidden layer hh, replace hh+αvh ← h + \alpha v.

Observing the Switch

The authors observe that some LRMs inherently exhibit both fast- and slow-thinking modes. They analyse the leading-word frequencies in top-100 shortest/longest responses by DeepSeek-R1-Distill-Qwen-7B on MATH-500.

Leading-word statistics

Given this fairly stark difference, they hypothesise that different leading words steer LRMs to long/short responses. This is investigated by seeding the response with different words - presumably "To" and "Okay", but unclear.

Fast/slow thinking statistics

There is a pretty significant drop in both performance and token count in fast vs slow.

Finding, Extracting and Activating the Switch

RopE (uncited) claims that abstract cognitive functions are encoded as linear directions in LLMs' representation space. The authors hypothesise similarly that distinct reasoning are governed by a direction. Three stages are used to compute the vector.

  1. Designing Stimuli. Given an input qiq_i, the fast/slow response aif/sa_i^{f/s} is used as the positive/negative stimulus, denoted Ti+/:=(qi,aif/s)T_i^{+/-} := (q_i, a_i^{f/s}).

  2. Collecting Hidden Representations. For each layer, each input stimulus is processed, and the hidden states hh at the final position of the stimulus is collected.

  3. Constructing the PCA Model. Denote the pairs collected in Step 2 as {(h1+,h1),(h2+,h2),...,(hn+,hn)}\{(h_1^+, h_1^-), (h_2^+, h_2^-), ..., (h_n^+, h_n^-)\}. For half the pairs, calculate the difference di+:=hi+hid_i^{-\to+} := h_i^+ - h_i^- (in/2i \le n/2) and the other half the reverse dj+:=hjhj+d_j^{+\to-} := h_j^- - h_j^+ (j>n/2j > n/2); this forms a dataset {di+}in/2{dj+}j>n/2\{d_i^{-\to+}\}_{i \le n/2} \cup \{d_j^{+\to-}\}_{j > n/2}. The first principal component vv aligns with "slow → fast": vTdi+>0v^T d_i^{-\to+} > 0 and vTdj+<0v^T d_j^{+\to-} < 0.

The vector vv is used as the steering vector. Specifically, at a given lager, the hidden state hh is modified with intensity α\alpha: hh+αv.h \leftarrow h + \alpha v.

Experimental Results and Analysis

A collection of LRMs are benchmarked with differing values of α\alpha.

Summary of performance vs response length Qwen3-8B performance vs response length

Adaptive Control of Thinking Speed

The previous experiments fix αR\alpha \in \mathbb R throughout the run. Contrastingly, humans dedicate more/less computational energy to (perceived) difficult/easy parts. The goal of this section is to estimate the difficulty, and dynamically adjust α\alpha accordingly.

The details are rather technical, not particularly clearly explained and, I feel, probably not super important. So, the reader is directed to §4 in the paper. Just the (apparently impressive) results are reported here, with a small amount of commentary. Throughout α\alpha is restricted to [4,4][-4, 4].

Table of adaptive results

Conclusion

The paper makes a strong case for activating already-existing fast-/slow-thinking transitions. The training-free approach is much more natural than some of the somewhat ad-hoc soft thinking, mixture of inputs or similar approaches.

That said, the choice of steering vector is slightly ad-hoc, taking a selection of observed words. Perhaps this could be improved by trying to learn it? It is model-specific, though, which is a good start.

A clearer analysis of the adaptive case would be nice, with some kind of real-time estimate on the "thinking speed" - eg, 'effective α\alpha'.

Overall, it is a decent paper, making a much more concrete case for "System 1 vs System 2" than many others that I've read.

PS: §8 Societal impact discussion

"We believe our work would inspire future research on the development of more efficient and intelligent AI systems. We foresee no negative societal impacts from our research[.]"

Well, that's good to know. Here I was, thinking there could be pros and cons. No, "no negative societal impacts".