2024-09-13

OpenAI o1 Results on ARC-AGI-Pub

ARC-AGI as an AGI Benchmark

Many see ARC-AGI as one of the strongest existing AGI benchmarks because tasks are prior-less, few-shot, and the hidden test set is kept secret to limit overfitting.
Others argue it’s a distraction: success could come from data/strategy specific to ARC rather than general intelligence, as happened with earlier benchmarks (e.g., Winograd-style tasks).
Several commenters think solving ARC is “necessary but not sufficient” for AGI: a real AGI should do well on ARC, but an ARC-specialized system would not imply AGI.

Results, Performance, and Compute

Reported scores on the public set: GPT‑4o ~~9%, o1‑preview and Claude 3.5 Sonnet ~21%, a specialized system (“MindsAI”) ~46%, and a GPT‑4o+strategy setup (~~“Greenblatt”) ~42%.
Key takeaway: o1-preview is comparable in accuracy to Sonnet but vastly slower; 400 tasks take ~70 hours vs ~30 minutes for GPT‑4o/Sonnet, implying much higher inference-time compute.
Some see this as a significant step up from GPT‑4o; others note it’s “not that great” given the cost.

Nature of o1: Memorizing Reasoning

Discussion centers on the idea that o1 “memorizes reasoning patterns” via extra training/RL rather than achieving fundamentally new generalization.
Commenters expect hallucinations and failures on genuinely novel or complex problems to persist.
Hiding “reasoning tokens” is seen by some as a way to obscure this pattern-memorization.

Scaling, Multimodality, and Transformers

One camp is optimistic: native multimodal models and synthetic-data/distillation are seen as major untapped levers; no clear plateau yet.
Another camp points to log-scale gains vs compute (e.g., AIME curves) and semiconductor limits, predicting diminishing returns without new architectures.
Test-time scaling (more compute per query) is noted as important but costly.

Critiques of ARC Design and Interpretation

Some say ARC mostly tests visual/spatial pattern recognition and sample efficiency, not “intelligence,” and is unfairly hostile to data-hungry deep nets.
Others reply that low data per task is exactly the point: mirroring human few-shot abstraction and probing genuine generalization.
There is debate whether tasks are under-specified (many valid continuations) and whether they reduce to “guess how the puzzle author thinks.”

Broader Intelligence/Philosophy Debates

Thread digresses into whether intelligence becomes “trivial” with enough compute (e.g., brute-force simulating humans) and how realistic that is.
A long subthread argues about undecidable problems, whether humans can “identify” them in ways Turing machines cannot, and what that implies for testing machine intelligence.

OpenAI vs Anthropic and Practical Notes

Several see these results as highlighting Anthropic’s lead, especially on reasoning tasks, and criticize OpenAI’s recent direction and hype.
Others counter that OpenAI’s multimodal demos and “advanced mode” remain compelling.
Users ask pragmatic questions about using o1 for codebases and whether humans can directly attempt ARC tasks; links to the ARC site and tools are mentioned.

Related topics