2025-07-27

Hierarchical Reasoning Model

Claimed capabilities and excitement

HRM is reported to solve hard combinatorial tasks (extreme Sudoku, 30×30 mazes) with near-perfect accuracy and ~40% on ARC-AGI-2, using a 27M-parameter model trained “from scratch” on ~1,000 examples.
Commenters find the results “incredible” if correct, especially given the small model size and dataset, and appreciate that the authors released working code and checkpoints.
The architecture’s high-level / low-level recurrent split and adaptive halting (“thinking fast and slow”) are seen as conceptually elegant and reminiscent of human cognition.

Architecture, hierarchy, and symbolic flavor

HRM uses two interdependent recurrent modules: a slow, abstract planner and a fast, detailed module; low-level runs to a local equilibrium, then high-level updates context and restarts low-level.
This looped, hierarchical structure is compared to symbolic or “video game” AI, biological brains, and modular cognition theories (e.g., fuzzy trace, specialized brain regions).
Some see this as a promising direction: modular, compositional systems with many cooperating specialized submodules, potentially combined with MoE or LLMs.

Skepticism about results and methodology

Many are highly skeptical that a 27M model can be trained from scratch on 1,000 datapoints without overfitting, especially given lack of comparisons to same-sized, same-data transformers.
It’s noted that the “1,000 examples” claim hides heavy data augmentation (e.g., color relabeling and rotations, up to ~1,000×), so the effective dataset is far larger.
Concerns are raised about ARC-AGI usage: possible misuse of evaluation examples in training and discrepancies with public leaderboards.
Some argue the paper over-markets its relevance to “general-purpose reasoning,” analogous to saying a chess engine proves superiority over LLMs.

Scope, generalization, and scaling

Several commenters emphasize HRM appears purpose-built for constraint-satisfaction problems with few rules (Sudoku, mazes, ARC tasks).
Doubts are expressed about scaling this architecture to language or broad QA: language involves many more rules and using HRM-style loops with LLM-scale models would be very slow.
Others speculate about hybrids: many small HRMs for distinct subtasks, or an LLM with an HRM-like module for constraint-heavy subproblems.

Reproducibility and peer review debate

Code availability is widely praised, but practical replication is non-trivial: dependency/version issues, multi-GPU assumptions, long training times.
Some argue modern ML “real peer review” is open code + independent reproduction, while traditional conference peer review is described as a light “vibe check.”
Others counter that trusted institutions and formal review still matter to avoid pure echo chambers; there is disagreement on how much weight “peer reviewed” should carry here.

Related topics