2024-12-20

OpenAI O3 breakthrough high score on ARC-AGI-PUB

ARC-AGI as an AGI Test & “Goalpost Moving”

Many point out ARC-AGI was explicitly designed as a necessary-but-not-sufficient test; beating it does not imply AGI.
Some argue this is classic “goalpost moving” (like chess/Go/Turing test before): once a benchmark falls, skeptics say it was never a good AGI proxy.
Others counter that “general intelligence” is not binary; today’s LLMs are already general in many practical senses but far from the sci‑fi notion that radically transforms GDP, employment, or daily life.

What o3 Seems to Do Differently

o3 uses heavy test‑time compute: long chains-of-thought, multiple samples per task, and likely some kind of tree search over candidate solutions.
People liken this to chess engines increasing search depth: same underlying model, more inference compute, better results.
Some see this as confirmation that scaling inference-time search is a viable path forward after pre‑training gains started to flatten.

Benchmark Validity, Training, and Overfitting

ARC-AGI v1 is now considered “saturated”: ensembles of hand‑engineered Kaggle solutions already reach ~81%.
o3’s reported result used a version trained on the public ARC training set; several posters say this makes direct comparisons to other models less clean.
Future ARC-AGI v2 is expected to be much harder for o3 (early signs ~<30% even at high compute) while remaining easy for humans, suggesting plenty of headroom.
Broader concern: any public benchmark eventually gets “gamed” as models and teams optimize specifically for it.

Costs, Efficiency, and Scaling

Low-compute o3 reportedly costs ~$17–20 per ARC task; high-compute mode uses ~172× more compute, implying thousands of dollars per task and hundreds of thousands for the full eval.
Some see this as brute-force and uneconomic; others stress that early, expensive demonstrations often precede rapid efficiency gains (training and inference).

Human vs Model Performance

On ARC-AGI v1, high-compute o3 slightly exceeds the average “STEM grad” threshold and significantly beats average Mechanical Turk workers, but still fails various tasks humans find “trivially easy.”
On other benchmarks (SWE-bench, FrontierMath), o3 makes large jumps but still falls far short of expert human teams and full reliability.

Alternative Benchmarks & Remaining Weaknesses

Posters highlight other “easy for humans, hard for LLMs” suites: NovelQA, SimpleBench, GSM-Symbolic, function-calling, physical reasoning (PIQA), hallucination tests.
LLMs remain fragile under red herrings, minor rephrasings, long contexts, and tasks requiring world knowledge, memory over days, or rich spatial/motor grounding.
Many emphasize that real-world performance—e.g., writing and maintaining complex software systems—still lags far behind what benchmark headlines imply.

Economic and Social Implications

Strong anxiety among software engineers: concern that junior roles may disappear and that modest productivity boosts could shrink total headcount.
Others argue product complexity will grow, AI will act as a force multiplier, and demand for high-end engineers and new roles (coordination, supervision, domain expertise) will persist or increase.
Broader debates about middle-class erosion, UBI, concentration of power in AI-owning capital, and whether society will adapt equitably or repeat past “jobless recovery” patterns.

Emotional Tone

The thread mixes awe (especially around ARC and FrontierMath leaps) with deep skepticism about AGI claims, benchmark gaming, and corporate hype.
A significant undercurrent of unease: people feel caught between excitement for the tech and fear about careers, inequality, and longer-term societal stability.

Related topics