Hierarchical Reasoning Model

Claimed capabilities and excitement

  • HRM is reported to solve hard combinatorial tasks (extreme Sudoku, 30×30 mazes) with near-perfect accuracy and ~40% on ARC-AGI-2, using a 27M-parameter model trained “from scratch” on ~1,000 examples.
  • Commenters find the results “incredible” if correct, especially given the small model size and dataset, and appreciate that the authors released working code and checkpoints.
  • The architecture’s high-level / low-level recurrent split and adaptive halting (“thinking fast and slow”) are seen as conceptually elegant and reminiscent of human cognition.

Architecture, hierarchy, and symbolic flavor

  • HRM uses two interdependent recurrent modules: a slow, abstract planner and a fast, detailed module; low-level runs to a local equilibrium, then high-level updates context and restarts low-level.
  • This looped, hierarchical structure is compared to symbolic or “video game” AI, biological brains, and modular cognition theories (e.g., fuzzy trace, specialized brain regions).
  • Some see this as a promising direction: modular, compositional systems with many cooperating specialized submodules, potentially combined with MoE or LLMs.

Skepticism about results and methodology

  • Many are highly skeptical that a 27M model can be trained from scratch on 1,000 datapoints without overfitting, especially given lack of comparisons to same-sized, same-data transformers.
  • It’s noted that the “1,000 examples” claim hides heavy data augmentation (e.g., color relabeling and rotations, up to ~1,000×), so the effective dataset is far larger.
  • Concerns are raised about ARC-AGI usage: possible misuse of evaluation examples in training and discrepancies with public leaderboards.
  • Some argue the paper over-markets its relevance to “general-purpose reasoning,” analogous to saying a chess engine proves superiority over LLMs.

Scope, generalization, and scaling

  • Several commenters emphasize HRM appears purpose-built for constraint-satisfaction problems with few rules (Sudoku, mazes, ARC tasks).
  • Doubts are expressed about scaling this architecture to language or broad QA: language involves many more rules and using HRM-style loops with LLM-scale models would be very slow.
  • Others speculate about hybrids: many small HRMs for distinct subtasks, or an LLM with an HRM-like module for constraint-heavy subproblems.

Reproducibility and peer review debate

  • Code availability is widely praised, but practical replication is non-trivial: dependency/version issues, multi-GPU assumptions, long training times.
  • Some argue modern ML “real peer review” is open code + independent reproduction, while traditional conference peer review is described as a light “vibe check.”
  • Others counter that trusted institutions and formal review still matter to avoid pure echo chambers; there is disagreement on how much weight “peer reviewed” should carry here.