I got the highest score on ARC-AGI again swapping Python for English

Evolutionary / “other-loop” methods

  • Several commenters see the approach as similar to evolutionary systems (e.g., AlphaEvolve): text prompts define a high-level search space, and “genetic” mixing plus selection explores it.
  • This is framed as part of a broader trend: recent strong models reportedly use heavy “outer loop” search/verification beyond simple single-pass generation.
  • A key open problem: how to define good fitness functions for prompt/program evolution without hand-crafted human scoring; naive attempts stall quickly.

Scaffolding, self-scaffolding, and ASTs

  • Many argue LLMs are helpless on complex, multi-step tasks without rich scaffolding; models themselves are flexible but the scaffolds are brittle.
  • Proposed direction: “scaffolding synthesis” where one agent designs task-specific scaffolding (plans, tools, state machines, ASTs), then another agent executes it, with feedback to refine the scaffold.
  • Examples include compiling natural-language instructions or legal documents into AST-like structures, and existing tools (e.g., code+plan modes) are cited as early instances.

LLM weaknesses: memory, spatial reasoning, and vision

  • Empirical reports: models perform badly on Sokoban-like puzzles, nonograms, mazes, and ARC-style tasks—forgetting rules they previously derived and repeating disproven deductions.
  • Some attribute this mainly to poor long-range memory and reliance on lossy text context; others stress weak spatial/visual reasoning and current “bag-of-vision-tokens” frontends.
  • There is debate whether vision or memory is the primary blocker; multiple comments insist models need compact internal, non-verbal representations of rules and state.

ARC-AGI’s role and modality issues

  • Several see ARC-AGI as primarily a visual benchmark where humans have strong innate preprocessing; if puzzles were given as JSON, most people would first transform them into graphics.
  • Others note that strong computer-vision modules exist but haven’t yet produced very high ARC-AGI scores when bolted onto LLMs.
  • Some view this work as meaningful progress on one of the few benchmarks where humans still dominate; others think it’s “slightly smarter brute force” or overfitting to a contrived task.

Reasoning vs pattern matching and “PhD-level” claims

  • Long subthread debates whether LLMs genuinely “reason” or just perform sophisticated pattern matching.
  • One side argues: high benchmark scores, commonsense examples, and mech‑interp findings (latent world models, abstract circuits) imply functionally similar reasoning to humans, albeit text- and 1D-biased.
  • The opposing side stresses failures on simple puzzles, out-of-domain tasks, lack of runtime learning, and reliance on offline RL as signs they are closer to expert systems trained to the test.
  • Definitions are contested: some equate reasoning with advanced pattern matching; others insist true human-like reasoning must include continual learning and generalization to genuinely novel problems.

Dead zones, RL, and learning over time

  • The article’s notion of “dead reasoning zones” is challenged; critics say humans do exhibit systematic reasoning failures, especially in abductive inference or under cognitive dissonance.
  • Questions are raised about the claim that RL “forces logical consistency”; skeptics note that repeated trial-and-error with an oracle differs from humans’ one-shot reasoning and self-checking.
  • Several point out that LLMs could, in principle, approximate runtime learning via external memory plus periodic fine-tuning on their own experience, but this is not how today’s models generally operate.

Practical tools, reproducibility, and evaluation

  • Commenters share related frameworks (e.g., dSPY, GEPA-like approaches) and ask for reusable tools to run evolutionary prompt/program search at home with major APIs.
  • Links to the project’s GitHub and Kaggle notebooks are provided for replication.
  • Some worry that apparent improvements on public puzzles might just reflect training on blog posts or leaked solutions; others suggest controlled tests with pre‑ARC models and ablations of the new method.