ARC-AGI-3

Overall Reaction & Game Difficulty

  • Many commenters tried the demo tasks; reactions range from “intuitive and fun” to “I have no idea what to do” or “controls are janky/laggy.”
  • Prior gaming and puzzle experience strongly correlates with success; some lifelong gamers find levels trivial, others struggle to even infer the rules.
  • Several note this is not an IQ test but a test of rule inference, spatial reasoning, and adapting to the “style” of these puzzles.

What’s New vs ARC-AGI-1/2

  • v2 was static pattern-completion; v3 is interactive and multi-step.
  • New dimensions: multi-turn planning, exploration/exploitation, agentic behavior, cross-level transfer, and spatial reasoning under changing world state.
  • Some see this as a natural evolution to keep pressure on models as earlier benchmarks get saturated.

Scoring, Human Baseline & Interpretation

  • Score is not “% puzzles solved” but squared efficiency vs the second-best human action count, with later/harder levels weighted more.
  • Even humans solving many levels but using more steps than top solvers can score in the single digits; several note median human might be well below 50%.
  • Frontier models mostly score ~0–3%; some argue this looks worse than it is, others say the opaque score is misleading.
  • Benchmark authors defend the design as discouraging brute force and rewarding sample-efficient rule learning.

Harnesses, Tools, and Inputs

  • Official runs disallow ARC-specific harnesses; models get a simple prompt and JSON grids, though they may have hidden tools behind the API.
  • Some argue denying LLMs vision while humans get a GUI is unfair; others reply that an AGI should handle arbitrary encodings or build its own visualizer.
  • Debate over whether generic tools (e.g., Python, GUIs) should be allowed and how to detect “benchmaxxing.”

Relation to AGI & Definitions

  • One camp: any task humans find easy and models find hard is valuable; when no such tasks remain, we effectively have AGI.
  • Another camp: game competence is at best a necessary condition; AGI entails broad real-world capabilities, human-like learning efficiency, or human-like interaction, not just puzzle scores.
  • Some argue the bar is creeping upward (ARC-AGI-1 → 2 → 3 → …) and that “AGI” here is largely a moving marketing label.

Usefulness, Generalization & Economics

  • Supporters see this as a productive adversarial benchmark to drive better generalization and agentic reasoning, not just Q&A.
  • Skeptics say models will eventually just be pre-trained on similar games, turning this into another narrow benchmark.
  • Broader worries surface about economic impact once the “learning gap” to humans closes, though others argue economic disruption will happen even without formal AGI.

Enthusiasm vs Skepticism

  • Enthusiasts: call it a “good and clever” benchmark, fun to play, and likely to push models toward more useful planning and reasoning.
  • Skeptics: question its conceptual link to AGI, the fairness of inputs, the scoring complexity, and the recurring pattern of “new unsolved ARC-AGI → quickly solved → new version.”