Reasoning by Superposition Summary
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
- 2025-05; Zhu, Hao, Hu, Jiao, Russell, Tian
High-Level Summary
- Theoretical analysis of continuous chain of thought
- Concretely, continuous CoT solves graph-reachability with diameter number of steps, rather than vertex squared—ie, d vs O(n2)
- The continuous thought vector is a superposition enabling encoding of multiple search frontiers simultaneously—a 'parallel BFS'
- Notably, this approach emerged naturally, without explicit supervision
- Construction works for widely-used position encodings, not problem-specific ones
Some Details
Graph Reachability Problem
-
Input:
- direct graph G=(V,E) where V={v1,...,vn} is the vocab
- root node r and two candidate destination nodes c1 and c2
-
Objective:
- determine which of c1 and c2 is reachable from r
- it is given that precisely one is reachable
-
Edge e=(s,t)∈E⊆V2 is of the form (source,target)
Attention Chooser (simplified). If the current token is ⟨x⟩, then there is a construction of key-/query-matrices such that almost all attention is paid to position i−ℓ, where i is the current position and ℓ a pre-defined lag.
The core idea is to design the query and key vectors such that their inner product, which detemrines the attention score, is maximised at the desired position.
Superposition Construction (simplified). There is a choice of parameters such that the c-th continuous thought corresponds to the normalised, uniform superposition of (the embeddings of) all vertices reaching from the root r within c steps.
The proof involves constructing a two-layer transformer:
- copy source and target node embeddings of an edge to the token;
- perform one-step expansion of currently-explored vertices.
An MLP layer filters noise and equilibrates the weights remaining in the superposition.
Empirical Validation
The theoretical claims are backed up with some empirical validation. The figure below compares the accuracy of Coconut with two layers (blue, 98%) with vanilla CoT (brown, 75%), CoT* with 12 layers (green, 83%) and no CoT (pink, 75%).
