OpenAI o1 Results on ARC-AGI-Pub

ARC-AGI as an AGI Benchmark

  • Many see ARC-AGI as one of the strongest existing AGI benchmarks because tasks are prior-less, few-shot, and the hidden test set is kept secret to limit overfitting.
  • Others argue it’s a distraction: success could come from data/strategy specific to ARC rather than general intelligence, as happened with earlier benchmarks (e.g., Winograd-style tasks).
  • Several commenters think solving ARC is “necessary but not sufficient” for AGI: a real AGI should do well on ARC, but an ARC-specialized system would not imply AGI.

Results, Performance, and Compute

  • Reported scores on the public set: GPT‑4o 9%, o1‑preview and Claude 3.5 Sonnet ~21%, a specialized system (“MindsAI”) ~46%, and a GPT‑4o+strategy setup (“Greenblatt”) ~42%.
  • Key takeaway: o1-preview is comparable in accuracy to Sonnet but vastly slower; 400 tasks take ~70 hours vs ~30 minutes for GPT‑4o/Sonnet, implying much higher inference-time compute.
  • Some see this as a significant step up from GPT‑4o; others note it’s “not that great” given the cost.

Nature of o1: Memorizing Reasoning

  • Discussion centers on the idea that o1 “memorizes reasoning patterns” via extra training/RL rather than achieving fundamentally new generalization.
  • Commenters expect hallucinations and failures on genuinely novel or complex problems to persist.
  • Hiding “reasoning tokens” is seen by some as a way to obscure this pattern-memorization.

Scaling, Multimodality, and Transformers

  • One camp is optimistic: native multimodal models and synthetic-data/distillation are seen as major untapped levers; no clear plateau yet.
  • Another camp points to log-scale gains vs compute (e.g., AIME curves) and semiconductor limits, predicting diminishing returns without new architectures.
  • Test-time scaling (more compute per query) is noted as important but costly.

Critiques of ARC Design and Interpretation

  • Some say ARC mostly tests visual/spatial pattern recognition and sample efficiency, not “intelligence,” and is unfairly hostile to data-hungry deep nets.
  • Others reply that low data per task is exactly the point: mirroring human few-shot abstraction and probing genuine generalization.
  • There is debate whether tasks are under-specified (many valid continuations) and whether they reduce to “guess how the puzzle author thinks.”

Broader Intelligence/Philosophy Debates

  • Thread digresses into whether intelligence becomes “trivial” with enough compute (e.g., brute-force simulating humans) and how realistic that is.
  • A long subthread argues about undecidable problems, whether humans can “identify” them in ways Turing machines cannot, and what that implies for testing machine intelligence.

OpenAI vs Anthropic and Practical Notes

  • Several see these results as highlighting Anthropic’s lead, especially on reasoning tasks, and criticize OpenAI’s recent direction and hype.
  • Others counter that OpenAI’s multimodal demos and “advanced mode” remain compelling.
  • Users ask pragmatic questions about using o1 for codebases and whether humans can directly attempt ARC tasks; links to the ARC site and tools are mentioned.