2025-03-04

ARC-AGI without pretraining

Pretraining, knowledge, and “generality”

One camp argues that extensive pretraining undermines the spirit of a truly general system: a “pure” ARC solver that infers a rule from a few examples feels closer to AGI than a giant model trained on most of the test distribution.
Others counter that “general intelligence is useless without vast knowledge”; pretraining supplies knowledge, while the algorithms embody intelligence.
Several people note the line between pretraining and in‑context learning is blurry or purely technological; a long-context model plus a “bootstrap sequence” is effectively being pretrained at inference.

Innate structure vs learned experience (human analogy)

Multiple comments stress that human brains are heavily “pretrained” by evolution (genomic bottleneck, specialized brain regions, instincts) plus years of sensory input.
Newborns and toddlers already possess concepts like object persistence and basic “folk physics”; this is likened to baked-in priors, not blank-slate learning.
Others emphasize that humans still generalize far beyond what is genetically encoded, so we shouldn’t demand zero prior structure in machines, only reusability of that structure across tasks.

Compression and intelligence

Many participants accept a deep connection between intelligence and compression (Kolmogorov complexity, MDL, Hutter prize): good intelligence finds short models that predict complex data.
Others are skeptical: achieving ~34.75% train / 20% eval on ARC doesn’t prove “intelligence = compression”; compression can be improved without capturing high‑level concepts, and “maximal compression” claims are challenged as hand‑wavy.
Chollet’s opposing view (intelligence ≠ mere compression) is referenced, and some note that compression ideas also underpin VAEs and standard regularization, so the philosophy is not new.

How CompressARC works (high-level)

The method trains a separate small neural network per puzzle, using only that puzzle’s examples.
Objective = exactly fit the examples while minimizing a description-length‑style measure: bits to encode latent z, weights θ, and corrections from predicted to true pixels (via KL divergences and weight penalties).
This is described as Bayesian deep learning / VAE‑like, with heavy architectural engineering and strong equivariances; z is the only way to “break symmetries,” which helps avoid trivial memorization and latent collapse.

Strengths, limitations, and scope

Enthusiasts find it remarkable that a non‑pretrained model, given only a few examples, can solve ~20% of unseen ARC puzzles in ~20 minutes per puzzle.
Critics argue ARC puzzles are extremely “clean” and low-noise; with more arbitrary structure, mere compression might fail or just memorize.
There’s concern that the architecture is overly tailored to ARC and may not transfer to other domains or noisy data. Some doubt the method meets ARC’s 12‑hour, 100‑puzzle competition constraint.

Generalization, AGI, and benchmarks

Several comments broaden to AGI: is AlphaZero‑style general game play already “narrow AGI,” or must a single architecture learn any human task without task‑specific redesign?
Some note that humans themselves are specialized and education‑heavy, so demanding a single instance that does “everything everywhere at once” may be unrealistic.
Others insist that human‑level generality (same architecture, new tasks via learning, not redesign) remains the relevant target; otherwise, we’re just building many specialized systems, not AGI.

Related topics