This overview assumes familiarity with the architecture used in HRM. This can be found in the HRM summary, particularly the Framework section.
The official ARC team have analysed HRM:
First, the results were approximately reproduces on the semi-private sets.
A series of ablation studies call into question the narrative around HRM.
Findings 2 & 3 suggest the approach is fundamentally similar to ARC-AGI without Pretraining (Liao & Gu, '24).
The four main components of the HRM paper are investigated: the model architecture, the high–low hierarchical computation, the outer refinement loop and the use of data augmentation.
Specifically, the following were tested, taken verbatim from the relevant section.
Two experiments were performed.

On ARC-AGI-1, a regular transformer comes within ~5pp of the HRM model, without hyperparameter optimisation. Varying the number of H- and L-level steps vs the baseline (L=2 and H=2) only decreased the performance.
These suggest that, whilst the HRM architecture provides a small benefit, it is not the main driver of HRM's performance on ARC-AGI.
The model feeds its high-level output back into itself, allowing iterative refinements. It uses "adaptive computational time" (ACT) to control the number of iterations. To analyse this, the maximum number of outer loops during training (forcing maximum number during inference, as in the HRM implementation) was varied. Below, the mark "ACT N Loops" means a maximum of N loops, with ACT early stopping.

Clearly, this parameter has a large impact, doubling ~20% to ~40%. Interestingly, using ACT with maximum of 16 vs fixed 16 is a slight improvement.
To understand refinement during training vs inference, the number of loops was varied during inference too.

Training with more refinement makes a big difference (doubles, >15pp) when no inference refinement is allowed. More than 4 refinement loops for inference has little impact.
The original HRM is trained on augmented version of the demonstration (example) pairs, from both the training and evaluation sets; ConceptARC is also used.
NB. This does not imply data leakage: the model never sees, at training time, the test (problem) pairs. Fundamentally, this is a zero-pretraining test-time training approach. To emphasise, it uses the example pairs, as well as augmented versions, from the entire evaluation set to train.
This certainly raises questions about its generalisability: it relies heavily on data augmentation, including for other tasks such as Sudoku.
This is similar to ARC-AGI without Pretraining (Liao & Gu, '24), and amounts to using the model as a kind of program synthesis substrate: gradient descent on example pairs encodes in the weights a program that performs the task.
To understand cross-task transfer learning, the 400 training tasks and 160 ConceptARC tasks were removed. This dropped performance from 41% to 31%. The ARC Prize team suggests the performance is driven by test-time training. I'm not so sure.
The ARC Prize team suggest a stronger version: train on just one evaluation task (with augmentations). This is Liao & Gu's set-up. They speculate the results would be similar (21% pass@2).
Another interesting test would be to train solely on the training set, and evaluate on the eval set. Augmentations can still be used: create n augmented versions → make predictions → undo augmentation → majority selection. The weights wouldn't be updated during these augmentations, though.
HRM makes predictions for all augmented versions of a task, then reverts the augmentation to place in the original format. Majority voting selects the final candidate(s). Two modifications are tested.
The latter is restricted to reducing the number, since HRM can't process augmentation types not encountered during training.

Two trends are exhibited.
First and foremost, the dataset: HRM "flattens" all input–output pairs. Each pair gets an id, which consists of the task hash and a code for the augmentation applied.
At training and inference time, the model only receives the input and id—there is no few-shot context with other examples of the task. The model has to learn to relate an id to a specific transformation.
To that end, the model feeds the id into a large embedding layer; without this, it wouldn't know what to do with an input. This presents a major limitation: the model can only be applied on puzzles with ids it's seen in training.
In conversation with the ARC Prize team, the HRM authors claimed that changing the embeddings for few-shot contexts is complex engineeringwise. As such, inference data has to be part of the training dataset.
Last, whilst the refinement loop clearly has significant impact on performance, HRM is purely transductive: at no point is the underlying program explicit. The ARC Prize team hypothesise this won't generalise.