2026-03-25

ARC-AGI-3

Overall Reaction & Game Difficulty

Many commenters tried the demo tasks; reactions range from “intuitive and fun” to “I have no idea what to do” or “controls are janky/laggy.”
Prior gaming and puzzle experience strongly correlates with success; some lifelong gamers find levels trivial, others struggle to even infer the rules.
Several note this is not an IQ test but a test of rule inference, spatial reasoning, and adapting to the “style” of these puzzles.

What’s New vs ARC-AGI-1/2

v2 was static pattern-completion; v3 is interactive and multi-step.
New dimensions: multi-turn planning, exploration/exploitation, agentic behavior, cross-level transfer, and spatial reasoning under changing world state.
Some see this as a natural evolution to keep pressure on models as earlier benchmarks get saturated.

Scoring, Human Baseline & Interpretation

Score is not “% puzzles solved” but squared efficiency vs the second-best human action count, with later/harder levels weighted more.
Even humans solving many levels but using more steps than top solvers can score in the single digits; several note median human might be well below 50%.
Frontier models mostly score ~0–3%; some argue this looks worse than it is, others say the opaque score is misleading.
Benchmark authors defend the design as discouraging brute force and rewarding sample-efficient rule learning.

Harnesses, Tools, and Inputs

Official runs disallow ARC-specific harnesses; models get a simple prompt and JSON grids, though they may have hidden tools behind the API.
Some argue denying LLMs vision while humans get a GUI is unfair; others reply that an AGI should handle arbitrary encodings or build its own visualizer.
Debate over whether generic tools (e.g., Python, GUIs) should be allowed and how to detect “benchmaxxing.”

Relation to AGI & Definitions

One camp: any task humans find easy and models find hard is valuable; when no such tasks remain, we effectively have AGI.
Another camp: game competence is at best a necessary condition; AGI entails broad real-world capabilities, human-like learning efficiency, or human-like interaction, not just puzzle scores.
Some argue the bar is creeping upward (ARC-AGI-1 → 2 → 3 → …) and that “AGI” here is largely a moving marketing label.

Usefulness, Generalization & Economics

Supporters see this as a productive adversarial benchmark to drive better generalization and agentic reasoning, not just Q&A.
Skeptics say models will eventually just be pre-trained on similar games, turning this into another narrow benchmark.
Broader worries surface about economic impact once the “learning gap” to humans closes, though others argue economic disruption will happen even without formal AGI.

Enthusiasm vs Skepticism

Enthusiasts: call it a “good and clever” benchmark, fun to play, and likely to push models toward more useful planning and reasoning.
Skeptics: question its conceptual link to AGI, the fairness of inputs, the scoring complexity, and the recurring pattern of “new unsolved ARC-AGI → quickly solved → new version.”

Related topics