Details of the HRM architecture can be found in my HRM summary. Familiarity with this is assumed in this summary.
HRM-Agent uses "the simplest possible model architecture to adapt the HRM [into] an RL agent".
MSE loss between the current model's predicted -values, for the current environment step, and bootstrapped target values:
here, is a copy of the current model and is a terminal-step indicator, precluding rewards beyond the end of an episode.
The Adaptive Computation Time (ACT) component, allowing early exiting during training, is disabled. This facilitates convergence analysis of the recurrent state , with its 'low' () and 'high' () levels of hierarchy.
The model how has two time-like dimensions.
HRM randomly initialises the latent states and at the start of every forward pass (which consists of ≤16 recursions). In many environments, particularly ones that are not fully observable from the start, the contents of from the previous environment timestep is still relevant for the current one. In fact, consistency is often important for plan execution.
Two HRM-Agent variants are evaluated.
The HRM paper focussed on ARC-AGI, Sudoku and maze path-planning. The authors adapt the latter to a dynamic, partially-observed problem by placing and removing obstacles. The agent is given the full state at the start of each episode, but the state changes throughout the episode - the has to find this out for themselves.
EntryNotFound (FileSystemError): Error: ENOENT: no such file or directory, open '/mnt/c/Users/samot/projects/summaries-with-sam/papers/...'
Four-rooms. Precisely one door is closed at all times. Each environment step, the closed door is randomly reassigned with probability . This is infrequent enough that it is often worth finding an alternative path when reaching a closed door.
Random maze. The maze begins with a regular grid of permanent walls, along with 10 fixed walls and 5 dynamic walls, whose locations are both randomly assigned. Each door independently opens/closes with probability at each environment timestep.
The experiments were designed with the following purposes:
Figure 3 displays the proportion of episodes in which the agent reaches the goal. Each training run is displayed as a separate series.
Success in the random maze environment demonstrates a generalised ability to plan paths. The large number of possible mazes, along with small parameter count, reduces the change that the agent has just learned the optimal path for each.
The authors (claimed to have) hypothesised that the recurrent state contains part (or all) of the planned path to the goal, even though the agent only moves one square at a time. This is because the best next action depends on knowledge of a full viable path.
Four conditions were analysed (two choices for two options): carry/reset ; environment has/hasn't changed. For each, the MSE between (recurrent timestep ) and (the final one, using the HRM notation)
The authors ask us to consider the following points before viewing the plots.
The authors describe their results, shown below, as "broadly agree[ing]" with these statements.
The divergence between the first and -th converged recurrent states - ie, the latent output of the first and -th forward passes - in the random maze environment. Here, the final latent state is indeed significantly closer to the initial in the "Carry " version. The authors say that suggest that resultant paths are more consistent in this case.
These rather inconsistent results make any conclusions difficult. This is ML, though, so that doesn't stop the authors: "these support the belief that the agent is constructing a representation of the path to the goal in its latent state (ie, reasoning), before generating the next action from this plan".


