30% drop in O1-preview accuracy when Putnam problems are slightly variated
Benchmark contamination & “training on the test”
- Many assume Putnam problems are in LLM training corpora, since the archive is public and models are trained on “whatever they can get.”
- Some argue this is not “cheating” because Putnam is not an official benchmark used by labs, unlike held‑out sets such as MMLU, ARC‑AGI, or FrontierMath.
- Others counter that once any problem set becomes a de facto yardstick in media or social media, vendors are incentivized to overfit to it, explicitly or via data contamination.
- There’s disagreement over how rigorously big labs de‑duplicate or exclude benchmark data at web scale, and how much to trust their assurances.
Pattern-matching vs generalization
- The 30% accuracy drop under small variations (renaming variables, changing constants, minor structural tweaks) is widely read as evidence of heavy pattern matching and memorization.
- Some see this as “overfitting” or “teaching to the test,” not robust mathematical understanding.
- Others emphasize that performance only partially degrades, not to zero, which suggests limited but real abstraction.
Comparisons to other benchmarks and models
- Multiple references to o3 getting ~25% on the held‑out FrontierMath benchmark; supporters present this as strong evidence of genuine reasoning on unseen problems.
- Skeptics question contamination claims, methodology (e.g., simulated Codeforces runs, number of submissions, non‑live evaluations), and note independent attempts often find weaker performance on live contests.
- Several point out the paper tested o1‑preview; newer o1/o1‑pro reportedly do better on the same variations, but this might reflect retraining on the released dataset.
Test‑time compute and “reasoning models”
- Discussion of o‑series models using test‑time compute, chain‑of‑thought, and likely some form of search/tree‑of‑thought, as distinct from older “one‑shot” next‑token models.
- Some argue this is a meaningful step toward reasoning; others say it is still pattern‑guided search in latent space, not true generalization.
Toy tests, tricks, and failure modes
- Many concrete examples: river‑crossing puzzles, “which is heavier” questions, counting letters in sentences, riddles about family relationships, and buoyancy subtleties.
- These often expose that models latch onto familiar puzzle templates and ignore small but decisive wording changes, or invent plausible‑sounding but wrong explanations.
- A recurring theme: models can be coaxed into correct step‑by‑step reasoning with explicit prompts, but default, fast answers are brittle.
Broader views on intelligence and impact
- One camp says LLMs are just very strong pattern recognizers or “stochastic parrots,” incapable of the kind of conceptual leaps exemplified by, say, pre‑1905 derivation of relativity.
- Others insist the line between human “understanding” and large‑scale pattern learning is blurry, and note that many humans also rely on exam cramming and template matching.
- There’s meta‑debate about “moving the goalposts” for what counts as intelligence once models pass former milestones (Turing‑test‑like behavior, exam performance).
- Economic anxiety surfaces: huge investment vs modest real‑world returns, fear of an “AI bust,” and suspicion that hype and selective benchmarking are driven by financial pressure.