CIPHER Summary

Let Models Speak Ciphers: Multiagent Debate through Embeddings

CIPHER Summary

Some Details

High-Level Ideas

Some Details

There are two main aspects of this paper:

Mixture of Tokens

A standard causal LLM generates tokens autoregressively based on the previous tokens. Given a prompt\texttt{prompt} and the first t1t-1 generated response tokens res1:t1\texttt{res}^{1:t-1}, it calculates a vector of logits logit(e(prompt)e(res1:t1))\mathop{\textsf{logit}}(\mathfrak e(\texttt{prompt}) \circ \mathfrak e(\texttt{res}^{1:t-1})), where e\mathfrak e is the embedding function and \circ concatenates. The next token is then sampled from the vocab wrt

pt=[p1t,...,pVt]=softmax(logit(e(prompt)e(res1:t1))/T),p^t = [p^t_1, ..., p^t_V] = \mathop{\textsf{softmax}}\bigl( \mathop{\textsf{logit}} \bigl( \mathfrak e(\texttt{prompt}) \circ \mathfrak e(\texttt{res}^{1:t-1}) \bigr) / T \bigr),

where TT is the temperature. This sampling discards all information in the probability distribution. CIPHER retains it, but plugging back in not the sampled token, but a weighted average of (the embeddings of) all tokens. Formally,

et=i=1Vpite(vi)where[p1t,...,pVt]=softmax(logit(e(prompt)e1:t1)/T).\textstyle \mathfrak e^t = \sum_{i=1}^V p^t_i \mathfrak e(v_i) \quad\textsf{where}\quad [p^t_1, ..., p^t_V] = \mathop{\textsf{softmax}}\bigl( \mathop{\textsf{logit}}\bigl( \mathfrak e(\texttt{prompt}) \circ \mathfrak e^{1:t-1} \bigr) / T \bigr).

To emphasise, the embeddings e(vi)\mathfrak e(v_i) need only be calculated once. There is no tt-th token which is to be embedded; rather the tt-th embedding et\mathfrak e^t is calculated directly as a convex combination of the (precalculated) vocab embedding.

The generation process stops when either of two conditions hold.

  1. The end-of-sequence (EOS) token embedding becomes the nearest neighbour to the newly generated embedding.
  2. The maximal sequence length is reached.

If the resopnse length is τ\tau, the CIPHER response is cipher=e1:t1\texttt{cipher} = \mathfrak e^{1:t-1}.

Debate

The CIPHER debate procedure has the following steps.

  1. Convert the question and instructions into embeddings embprompt\textsf{emb}_\texttt{prompt}.

  2. For each debate round, form an embedding representation by concatenating embprompt\textsf{emb}_\texttt{prompt} and (possible) CIPHER responses, cipheri\texttt{cipher}_i, from all debaters in previous rounds.

  3. Feed this embedding representation into the models without the token decoding step. The debaters generate refine CIPHER responses, cipheri\texttt{cipher}_i, following the previous procedure.

  4. To close the debate, convert the embedding responses back to natural language using nearest-neighbour search over the vocabulary set, then aggregate.

This is visualised in the paper, in Algorithm 1.

CIPHER debate algorithm

Results

CIPHER is benchmarked against three baselines.

  1. Single Answer: a single LLM provides one response in natural language.
  2. Self-Consistency: a single LLM independently generates multiple responses (five here), then applies majority voting.
  3. Natural Language Debate: each LLM provides an initial response, then uses each other's response to refine their previous response.

The third baseline is closest to CIPHER, with the primary difference being the method of communication (sampled token vs distribution).

Most experiments are conducted using LLaMA2-70B, one of the largest open-source models available at the time. The evaluation is across five reasoning benchmarks.

Evaluation using same LLaMA model

Table 1 uses two debating LLMs, from the same family. Table 2 has LLaMA2-70B and LLaMA2-65B debate each other; unsurprisingly, the 70B version performs worse when paired with a 65B (vs another 70B) and the 65B does better with a 70B (vs 65B).

An ablation study is also undertaken, but not described here; see Section 5.3 of the paper for details.