ARC-AGI-3
Overall Reaction & Game Difficulty
- Many commenters tried the demo tasks; reactions range from “intuitive and fun” to “I have no idea what to do” or “controls are janky/laggy.”
- Prior gaming and puzzle experience strongly correlates with success; some lifelong gamers find levels trivial, others struggle to even infer the rules.
- Several note this is not an IQ test but a test of rule inference, spatial reasoning, and adapting to the “style” of these puzzles.
What’s New vs ARC-AGI-1/2
- v2 was static pattern-completion; v3 is interactive and multi-step.
- New dimensions: multi-turn planning, exploration/exploitation, agentic behavior, cross-level transfer, and spatial reasoning under changing world state.
- Some see this as a natural evolution to keep pressure on models as earlier benchmarks get saturated.
Scoring, Human Baseline & Interpretation
- Score is not “% puzzles solved” but squared efficiency vs the second-best human action count, with later/harder levels weighted more.
- Even humans solving many levels but using more steps than top solvers can score in the single digits; several note median human might be well below 50%.
- Frontier models mostly score ~0–3%; some argue this looks worse than it is, others say the opaque score is misleading.
- Benchmark authors defend the design as discouraging brute force and rewarding sample-efficient rule learning.
Harnesses, Tools, and Inputs
- Official runs disallow ARC-specific harnesses; models get a simple prompt and JSON grids, though they may have hidden tools behind the API.
- Some argue denying LLMs vision while humans get a GUI is unfair; others reply that an AGI should handle arbitrary encodings or build its own visualizer.
- Debate over whether generic tools (e.g., Python, GUIs) should be allowed and how to detect “benchmaxxing.”
Relation to AGI & Definitions
- One camp: any task humans find easy and models find hard is valuable; when no such tasks remain, we effectively have AGI.
- Another camp: game competence is at best a necessary condition; AGI entails broad real-world capabilities, human-like learning efficiency, or human-like interaction, not just puzzle scores.
- Some argue the bar is creeping upward (ARC-AGI-1 → 2 → 3 → …) and that “AGI” here is largely a moving marketing label.
Usefulness, Generalization & Economics
- Supporters see this as a productive adversarial benchmark to drive better generalization and agentic reasoning, not just Q&A.
- Skeptics say models will eventually just be pre-trained on similar games, turning this into another narrow benchmark.
- Broader worries surface about economic impact once the “learning gap” to humans closes, though others argue economic disruption will happen even without formal AGI.
Enthusiasm vs Skepticism
- Enthusiasts: call it a “good and clever” benchmark, fun to play, and likely to push models toward more useful planning and reasoning.
- Skeptics: question its conceptual link to AGI, the fairness of inputs, the scoring complexity, and the recurring pattern of “new unsolved ARC-AGI → quickly solved → new version.”