2025-01-01

30% drop in O1-preview accuracy when Putnam problems are slightly variated

Benchmark contamination & “training on the test”

Many assume Putnam problems are in LLM training corpora, since the archive is public and models are trained on “whatever they can get.”
Some argue this is not “cheating” because Putnam is not an official benchmark used by labs, unlike held‑out sets such as MMLU, ARC‑AGI, or FrontierMath.
Others counter that once any problem set becomes a de facto yardstick in media or social media, vendors are incentivized to overfit to it, explicitly or via data contamination.
There’s disagreement over how rigorously big labs de‑duplicate or exclude benchmark data at web scale, and how much to trust their assurances.

Pattern-matching vs generalization

The 30% accuracy drop under small variations (renaming variables, changing constants, minor structural tweaks) is widely read as evidence of heavy pattern matching and memorization.
Some see this as “overfitting” or “teaching to the test,” not robust mathematical understanding.
Others emphasize that performance only partially degrades, not to zero, which suggests limited but real abstraction.

Comparisons to other benchmarks and models

Multiple references to o3 getting ~25% on the held‑out FrontierMath benchmark; supporters present this as strong evidence of genuine reasoning on unseen problems.
Skeptics question contamination claims, methodology (e.g., simulated Codeforces runs, number of submissions, non‑live evaluations), and note independent attempts often find weaker performance on live contests.
Several point out the paper tested o1‑preview; newer o1/o1‑pro reportedly do better on the same variations, but this might reflect retraining on the released dataset.

Test‑time compute and “reasoning models”

Discussion of o‑series models using test‑time compute, chain‑of‑thought, and likely some form of search/tree‑of‑thought, as distinct from older “one‑shot” next‑token models.
Some argue this is a meaningful step toward reasoning; others say it is still pattern‑guided search in latent space, not true generalization.

Toy tests, tricks, and failure modes

Many concrete examples: river‑crossing puzzles, “which is heavier” questions, counting letters in sentences, riddles about family relationships, and buoyancy subtleties.
These often expose that models latch onto familiar puzzle templates and ignore small but decisive wording changes, or invent plausible‑sounding but wrong explanations.
A recurring theme: models can be coaxed into correct step‑by‑step reasoning with explicit prompts, but default, fast answers are brittle.

Broader views on intelligence and impact

One camp says LLMs are just very strong pattern recognizers or “stochastic parrots,” incapable of the kind of conceptual leaps exemplified by, say, pre‑1905 derivation of relativity.
Others insist the line between human “understanding” and large‑scale pattern learning is blurry, and note that many humans also rely on exam cramming and template matching.
There’s meta‑debate about “moving the goalposts” for what counts as intelligence once models pass former milestones (Turing‑test‑like behavior, exam performance).
Economic anxiety surfaces: huge investment vs modest real‑world returns, fear of an “AI bust,” and suspicion that hype and selective benchmarking are driven by financial pressure.

Related topics