Hierarchical Reasoning Model
Claimed capabilities and excitement
- HRM is reported to solve hard combinatorial tasks (extreme Sudoku, 30×30 mazes) with near-perfect accuracy and ~40% on ARC-AGI-2, using a 27M-parameter model trained “from scratch” on ~1,000 examples.
- Commenters find the results “incredible” if correct, especially given the small model size and dataset, and appreciate that the authors released working code and checkpoints.
- The architecture’s high-level / low-level recurrent split and adaptive halting (“thinking fast and slow”) are seen as conceptually elegant and reminiscent of human cognition.
Architecture, hierarchy, and symbolic flavor
- HRM uses two interdependent recurrent modules: a slow, abstract planner and a fast, detailed module; low-level runs to a local equilibrium, then high-level updates context and restarts low-level.
- This looped, hierarchical structure is compared to symbolic or “video game” AI, biological brains, and modular cognition theories (e.g., fuzzy trace, specialized brain regions).
- Some see this as a promising direction: modular, compositional systems with many cooperating specialized submodules, potentially combined with MoE or LLMs.
Skepticism about results and methodology
- Many are highly skeptical that a 27M model can be trained from scratch on 1,000 datapoints without overfitting, especially given lack of comparisons to same-sized, same-data transformers.
- It’s noted that the “1,000 examples” claim hides heavy data augmentation (e.g., color relabeling and rotations, up to ~1,000×), so the effective dataset is far larger.
- Concerns are raised about ARC-AGI usage: possible misuse of evaluation examples in training and discrepancies with public leaderboards.
- Some argue the paper over-markets its relevance to “general-purpose reasoning,” analogous to saying a chess engine proves superiority over LLMs.
Scope, generalization, and scaling
- Several commenters emphasize HRM appears purpose-built for constraint-satisfaction problems with few rules (Sudoku, mazes, ARC tasks).
- Doubts are expressed about scaling this architecture to language or broad QA: language involves many more rules and using HRM-style loops with LLM-scale models would be very slow.
- Others speculate about hybrids: many small HRMs for distinct subtasks, or an LLM with an HRM-like module for constraint-heavy subproblems.
Reproducibility and peer review debate
- Code availability is widely praised, but practical replication is non-trivial: dependency/version issues, multi-GPU assumptions, long training times.
- Some argue modern ML “real peer review” is open code + independent reproduction, while traditional conference peer review is described as a light “vibe check.”
- Others counter that trusted institutions and formal review still matter to avoid pure echo chambers; there is disagreement on how much weight “peer reviewed” should carry here.