Let Models Speak Ciphers: Multiagent Debate through Embeddings
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Text Generation Beyond Discrete Token Sampling ("Mixture of Inputs")
Training Large Language Models to Reason in a Continuous Latent Space (COCONUT)
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
There are two main aspects of this paper:
A standard causal LLM generates tokens autoregressively based on the previous tokens. Given a and the first generated response tokens , it calculates a vector of logits , where is the embedding function and concatenates. The next token is then sampled from the vocab wrt
where is the temperature. This sampling discards all information in the probability distribution. CIPHER retains it, but plugging back in not the sampled token, but a weighted average of (the embeddings of) all tokens. Formally,
To emphasise, the embeddings need only be calculated once. There is no -th token which is to be embedded; rather the -th embedding is calculated directly as a convex combination of the (precalculated) vocab embedding.
The generation process stops when either of two conditions hold.
If the resopnse length is , the CIPHER response is .
The CIPHER debate procedure has the following steps.
Convert the question and instructions into embeddings .
For each debate round, form an embedding representation by concatenating and (possible) CIPHER responses, , from all debaters in previous rounds.
Feed this embedding representation into the models without the token decoding step. The debaters generate refine CIPHER responses, , following the previous procedure.
To close the debate, convert the embedding responses back to natural language using nearest-neighbour search over the vocabulary set, then aggregate.
This is visualised in the paper, in Algorithm 1.
CIPHER is benchmarked against three baselines.
The third baseline is closest to CIPHER, with the primary difference being the method of communication (sampled token vs distribution).
Most experiments are conducted using LLaMA2-70B, one of the largest open-source models available at the time. The evaluation is across five reasoning benchmarks.
Table 1 uses two debating LLMs, from the same family. Table 2 has LLaMA2-70B and LLaMA2-65B debate each other; unsurprisingly, the 70B version performs worse when paired with a 65B (vs another 70B) and the 65B does better with a 70B (vs 65B).
An ablation study is also undertaken, but not described here; see Section 5.3 of the paper for details.