I got the highest score on ARC-AGI again swapping Python for English
Evolutionary / “other-loop” methods
- Several commenters see the approach as similar to evolutionary systems (e.g., AlphaEvolve): text prompts define a high-level search space, and “genetic” mixing plus selection explores it.
- This is framed as part of a broader trend: recent strong models reportedly use heavy “outer loop” search/verification beyond simple single-pass generation.
- A key open problem: how to define good fitness functions for prompt/program evolution without hand-crafted human scoring; naive attempts stall quickly.
Scaffolding, self-scaffolding, and ASTs
- Many argue LLMs are helpless on complex, multi-step tasks without rich scaffolding; models themselves are flexible but the scaffolds are brittle.
- Proposed direction: “scaffolding synthesis” where one agent designs task-specific scaffolding (plans, tools, state machines, ASTs), then another agent executes it, with feedback to refine the scaffold.
- Examples include compiling natural-language instructions or legal documents into AST-like structures, and existing tools (e.g., code+plan modes) are cited as early instances.
LLM weaknesses: memory, spatial reasoning, and vision
- Empirical reports: models perform badly on Sokoban-like puzzles, nonograms, mazes, and ARC-style tasks—forgetting rules they previously derived and repeating disproven deductions.
- Some attribute this mainly to poor long-range memory and reliance on lossy text context; others stress weak spatial/visual reasoning and current “bag-of-vision-tokens” frontends.
- There is debate whether vision or memory is the primary blocker; multiple comments insist models need compact internal, non-verbal representations of rules and state.
ARC-AGI’s role and modality issues
- Several see ARC-AGI as primarily a visual benchmark where humans have strong innate preprocessing; if puzzles were given as JSON, most people would first transform them into graphics.
- Others note that strong computer-vision modules exist but haven’t yet produced very high ARC-AGI scores when bolted onto LLMs.
- Some view this work as meaningful progress on one of the few benchmarks where humans still dominate; others think it’s “slightly smarter brute force” or overfitting to a contrived task.
Reasoning vs pattern matching and “PhD-level” claims
- Long subthread debates whether LLMs genuinely “reason” or just perform sophisticated pattern matching.
- One side argues: high benchmark scores, commonsense examples, and mech‑interp findings (latent world models, abstract circuits) imply functionally similar reasoning to humans, albeit text- and 1D-biased.
- The opposing side stresses failures on simple puzzles, out-of-domain tasks, lack of runtime learning, and reliance on offline RL as signs they are closer to expert systems trained to the test.
- Definitions are contested: some equate reasoning with advanced pattern matching; others insist true human-like reasoning must include continual learning and generalization to genuinely novel problems.
Dead zones, RL, and learning over time
- The article’s notion of “dead reasoning zones” is challenged; critics say humans do exhibit systematic reasoning failures, especially in abductive inference or under cognitive dissonance.
- Questions are raised about the claim that RL “forces logical consistency”; skeptics note that repeated trial-and-error with an oracle differs from humans’ one-shot reasoning and self-checking.
- Several point out that LLMs could, in principle, approximate runtime learning via external memory plus periodic fine-tuning on their own experience, but this is not how today’s models generally operate.
Practical tools, reproducibility, and evaluation
- Commenters share related frameworks (e.g., dSPY, GEPA-like approaches) and ask for reusable tools to run evolutionary prompt/program search at home with major APIs.
- Links to the project’s GitHub and Kaggle notebooks are provided for replication.
- Some worry that apparent improvements on public puzzles might just reflect training on blog posts or leaked solutions; others suggest controlled tests with pre‑ARC models and ablations of the new method.