2024-06-17

Getting 50% (SoTA) on Arc-AGI with GPT-4o

What the result actually is

Method gets ~50% on ARC-AGI public evaluation set by having GPT‑4o generate ~8k Python programs per task and selecting ones that pass examples.
Private test set (the prize benchmark) is different; current private SOTA is ~34–35% and also around 50% on the public set.
It’s unclear whether this is true SOTA until independently reproduced and run on the private set.

Brute force, search, and program synthesis

Many see the approach as “generate-and-test” program synthesis with a large outer search loop, not “pure reasoning” by the LLM.
Debate over whether this counts as brute force: critics say it’s dumb search over many candidates; defenders say 8k samples over a huge program space is heavily guided by the LLM, so more heuristic than brute.
Several suggest combining this with better search (MCTS, AlphaZero-like methods, genetic programming, specialized DSLs) and/or fine‑tuned models for further gains.

ARC as an AGI benchmark

Some argue ARC was designed to stress generalization, compositionality, and human-like “core knowledge,” so success by LLM+outer-loop actually supports the benchmark’s value.
Others think ARC is flawed: small data, distribution mismatch between training and eval, and vulnerable to Goodharting / benchmark gaming.
There’s disagreement whether solving ARC would mean anything close to AGI, or just “one narrow benchmark beaten by scale and clever wrappers.”

Training data contamination & fairness

Concerns that GPT‑4o likely saw the public ARC tasks and related discussions in training, which may help indirectly.
Counterpoint: merely seeing tasks once or a few times doesn’t trivially allow regurgitation, and the main difficulty is writing correct programs and search, not memorizing answers.

LLMs, intelligence, and “AGI”

Split views:
- Some claim current LLMs already exhibit a weak but genuine form of general intelligence.
- Others stress missing properties: robust world models, in-context learning comparable to humans, reliable reasoning, autonomy, and efficient learning from few examples.
Broad agreement that hybrid approaches (LLM + program search / tools / outer loops) are promising, but not yet “human-like AGI.”

Related topics