Hierarchical Reasoning Model

Hierarchical Reasoning Model

HRM-Agent

Experiments and Results

High-Level Summary

Details of the HRM architecture can be found in my HRM summary. Familiarity with this is assumed in this summary.

HRM-Agent

Architecture

HRM-Agent uses "the simplest possible model architecture to adapt the HRM [into] an RL agent".

MSE loss between the current model's predicted QQ-values, for the current environment step, and bootstrapped target values:

L:=MSE(Qθ(s,a),y)wherey:=r+γ(1d)maxaQθ(s,a);\textstyle \mathcal L := \text{MSE}\bigl( Q_\theta(s, a), \: y \bigr) \quad\text{where}\quad y := r + \gamma (1 - d) \max_{a'} Q_{\theta^-}(s, a');

here, θ\theta^- is a copy of the current model and dd is a terminal-step indicator, precluding rewards beyond the end of an episode.

The Adaptive Computation Time (ACT) component, allowing early exiting during training, is disabled. This facilitates convergence analysis of the recurrent state zz, with its 'low' (zLz_L) and 'high' (zHz_H) levels of hierarchy.

Reusing Calculations

The model how has two time-like dimensions.

HRM randomly initialises the latent states zLz_L and zHz_H at the start of every forward pass (which consists of ≤16 recursions). In many environments, particularly ones that are not fully observable from the start, the contents of zz from the previous environment timestep is still relevant for the current one. In fact, consistency is often important for plan execution.

Two HRM-Agent variants are evaluated.

Experiments and Results

The HRM paper focussed on ARC-AGI, Sudoku and maze path-planning. The authors adapt the latter to a dynamic, partially-observed problem by placing and removing obstacles. The agent is given the full state at the start of each episode, but the state changes throughout the episode - the has to find this out for themselves.

EntryNotFound (FileSystemError): Error: ENOENT: no such file or directory, open '/mnt/c/Users/samot/projects/summaries-with-sam/papers/...'

The experiments were designed with the following purposes:

Validation of Concept

Figure 3 displays the proportion of episodes in which the agent reaches the goal. Each training run is displayed as a separate series.

Success in the random maze environment demonstrates a generalised ability to plan paths. The large number of possible mazes, along with small parameter count, reduces the change that the agent has just learned the optimal path for each.

Analysis of Recurrence: "Carry ZZ" vs "Reset ZZ"

The authors (claimed to have) hypothesised that the recurrent state zz contains part (or all) of the planned path to the goal, even though the agent only moves one square at a time. This is because the best next action depends on knowledge of a full viable path.

Four conditions were analysed (two choices for two options): carry/reset zz; environment has/hasn't changed. For each, the MSE between zL/Hiz_{L/H}^i (recurrent timestep ii) and zL/HNTz_{L/H}^{NT} (the final one, using the HRM notation)

The authors ask us to consider the following points before viewing the plots.

The authors describe their results, shown below, as "broadly agree[ing]" with these statements.

The divergence between the first and ii-th converged recurrent states - ie, the latent output of the first and ii-th forward passes - in the random maze environment. Here, the final latent state is indeed significantly closer to the initial in the "Carry ZZ" version. The authors say that suggest that resultant paths are more consistent in this case.

These rather inconsistent results make any conclusions difficult. This is ML, though, so that doesn't stop the authors: "these support the belief that the agent is constructing a representation of the path to the goal in its latent state (ie, reasoning), before generating the next action from this plan".

Analysis of convergence of latent state  to its final values in the 4-rooms environment

Analysis of convergence of latent state  to its final values in the random maze environment

Divergence of the recurrent state  from its first converged value

Conclusion