30% drop in O1-preview accuracy when Putnam problems are slightly variated

Benchmark contamination & “training on the test”

  • Many assume Putnam problems are in LLM training corpora, since the archive is public and models are trained on “whatever they can get.”
  • Some argue this is not “cheating” because Putnam is not an official benchmark used by labs, unlike held‑out sets such as MMLU, ARC‑AGI, or FrontierMath.
  • Others counter that once any problem set becomes a de facto yardstick in media or social media, vendors are incentivized to overfit to it, explicitly or via data contamination.
  • There’s disagreement over how rigorously big labs de‑duplicate or exclude benchmark data at web scale, and how much to trust their assurances.

Pattern-matching vs generalization

  • The 30% accuracy drop under small variations (renaming variables, changing constants, minor structural tweaks) is widely read as evidence of heavy pattern matching and memorization.
  • Some see this as “overfitting” or “teaching to the test,” not robust mathematical understanding.
  • Others emphasize that performance only partially degrades, not to zero, which suggests limited but real abstraction.

Comparisons to other benchmarks and models

  • Multiple references to o3 getting ~25% on the held‑out FrontierMath benchmark; supporters present this as strong evidence of genuine reasoning on unseen problems.
  • Skeptics question contamination claims, methodology (e.g., simulated Codeforces runs, number of submissions, non‑live evaluations), and note independent attempts often find weaker performance on live contests.
  • Several point out the paper tested o1‑preview; newer o1/o1‑pro reportedly do better on the same variations, but this might reflect retraining on the released dataset.

Test‑time compute and “reasoning models”

  • Discussion of o‑series models using test‑time compute, chain‑of‑thought, and likely some form of search/tree‑of‑thought, as distinct from older “one‑shot” next‑token models.
  • Some argue this is a meaningful step toward reasoning; others say it is still pattern‑guided search in latent space, not true generalization.

Toy tests, tricks, and failure modes

  • Many concrete examples: river‑crossing puzzles, “which is heavier” questions, counting letters in sentences, riddles about family relationships, and buoyancy subtleties.
  • These often expose that models latch onto familiar puzzle templates and ignore small but decisive wording changes, or invent plausible‑sounding but wrong explanations.
  • A recurring theme: models can be coaxed into correct step‑by‑step reasoning with explicit prompts, but default, fast answers are brittle.

Broader views on intelligence and impact

  • One camp says LLMs are just very strong pattern recognizers or “stochastic parrots,” incapable of the kind of conceptual leaps exemplified by, say, pre‑1905 derivation of relativity.
  • Others insist the line between human “understanding” and large‑scale pattern learning is blurry, and note that many humans also rely on exam cramming and template matching.
  • There’s meta‑debate about “moving the goalposts” for what counts as intelligence once models pass former milestones (Turing‑test‑like behavior, exam performance).
  • Economic anxiety surfaces: huge investment vs modest real‑world returns, fear of an “AI bust,” and suspicion that hype and selective benchmarking are driven by financial pressure.