Getting 50% (SoTA) on Arc-AGI with GPT-4o

What the result actually is

  • Method gets ~50% on ARC-AGI public evaluation set by having GPT‑4o generate ~8k Python programs per task and selecting ones that pass examples.
  • Private test set (the prize benchmark) is different; current private SOTA is ~34–35% and also around 50% on the public set.
  • It’s unclear whether this is true SOTA until independently reproduced and run on the private set.

Brute force, search, and program synthesis

  • Many see the approach as “generate-and-test” program synthesis with a large outer search loop, not “pure reasoning” by the LLM.
  • Debate over whether this counts as brute force: critics say it’s dumb search over many candidates; defenders say 8k samples over a huge program space is heavily guided by the LLM, so more heuristic than brute.
  • Several suggest combining this with better search (MCTS, AlphaZero-like methods, genetic programming, specialized DSLs) and/or fine‑tuned models for further gains.

ARC as an AGI benchmark

  • Some argue ARC was designed to stress generalization, compositionality, and human-like “core knowledge,” so success by LLM+outer-loop actually supports the benchmark’s value.
  • Others think ARC is flawed: small data, distribution mismatch between training and eval, and vulnerable to Goodharting / benchmark gaming.
  • There’s disagreement whether solving ARC would mean anything close to AGI, or just “one narrow benchmark beaten by scale and clever wrappers.”

Training data contamination & fairness

  • Concerns that GPT‑4o likely saw the public ARC tasks and related discussions in training, which may help indirectly.
  • Counterpoint: merely seeing tasks once or a few times doesn’t trivially allow regurgitation, and the main difficulty is writing correct programs and search, not memorizing answers.

LLMs, intelligence, and “AGI”

  • Split views:
    • Some claim current LLMs already exhibit a weak but genuine form of general intelligence.
    • Others stress missing properties: robust world models, in-context learning comparable to humans, reliable reasoning, autonomy, and efficient learning from few examples.
  • Broad agreement that hybrid approaches (LLM + program search / tools / outer loops) are promising, but not yet “human-like AGI.”