2025-09-17

I got the highest score on ARC-AGI again swapping Python for English

Evolutionary / “other-loop” methods

Several commenters see the approach as similar to evolutionary systems (e.g., AlphaEvolve): text prompts define a high-level search space, and “genetic” mixing plus selection explores it.
This is framed as part of a broader trend: recent strong models reportedly use heavy “outer loop” search/verification beyond simple single-pass generation.
A key open problem: how to define good fitness functions for prompt/program evolution without hand-crafted human scoring; naive attempts stall quickly.

Scaffolding, self-scaffolding, and ASTs

Many argue LLMs are helpless on complex, multi-step tasks without rich scaffolding; models themselves are flexible but the scaffolds are brittle.
Proposed direction: “scaffolding synthesis” where one agent designs task-specific scaffolding (plans, tools, state machines, ASTs), then another agent executes it, with feedback to refine the scaffold.
Examples include compiling natural-language instructions or legal documents into AST-like structures, and existing tools (e.g., code+plan modes) are cited as early instances.

LLM weaknesses: memory, spatial reasoning, and vision

Empirical reports: models perform badly on Sokoban-like puzzles, nonograms, mazes, and ARC-style tasks—forgetting rules they previously derived and repeating disproven deductions.
Some attribute this mainly to poor long-range memory and reliance on lossy text context; others stress weak spatial/visual reasoning and current “bag-of-vision-tokens” frontends.
There is debate whether vision or memory is the primary blocker; multiple comments insist models need compact internal, non-verbal representations of rules and state.

ARC-AGI’s role and modality issues

Several see ARC-AGI as primarily a visual benchmark where humans have strong innate preprocessing; if puzzles were given as JSON, most people would first transform them into graphics.
Others note that strong computer-vision modules exist but haven’t yet produced very high ARC-AGI scores when bolted onto LLMs.
Some view this work as meaningful progress on one of the few benchmarks where humans still dominate; others think it’s “slightly smarter brute force” or overfitting to a contrived task.

Reasoning vs pattern matching and “PhD-level” claims

Long subthread debates whether LLMs genuinely “reason” or just perform sophisticated pattern matching.
One side argues: high benchmark scores, commonsense examples, and mech‑interp findings (latent world models, abstract circuits) imply functionally similar reasoning to humans, albeit text- and 1D-biased.
The opposing side stresses failures on simple puzzles, out-of-domain tasks, lack of runtime learning, and reliance on offline RL as signs they are closer to expert systems trained to the test.
Definitions are contested: some equate reasoning with advanced pattern matching; others insist true human-like reasoning must include continual learning and generalization to genuinely novel problems.

Dead zones, RL, and learning over time

The article’s notion of “dead reasoning zones” is challenged; critics say humans do exhibit systematic reasoning failures, especially in abductive inference or under cognitive dissonance.
Questions are raised about the claim that RL “forces logical consistency”; skeptics note that repeated trial-and-error with an oracle differs from humans’ one-shot reasoning and self-checking.
Several point out that LLMs could, in principle, approximate runtime learning via external memory plus periodic fine-tuning on their own experience, but this is not how today’s models generally operate.

Practical tools, reproducibility, and evaluation

Commenters share related frameworks (e.g., dSPY, GEPA-like approaches) and ask for reusable tools to run evolutionary prompt/program search at home with major APIs.
Links to the project’s GitHub and Kaggle notebooks are provided for replication.
Some worry that apparent improvements on public puzzles might just reflect training on blog posts or leaked solutions; others suggest controlled tests with pre‑ARC models and ablations of the new method.

Related topics