LIMO: Less Is More for Reasoning
Compressibility of Reasoning
- Several comments speculate that there may be only a small set of generic “reasoning patterns,” but effective reasoning also needs domain‑specific steps (e.g., across math subfields) and strategies for handling impasses.
- The LIMO result is interpreted as evidence that much of this structure is already present in large pretrained models; small, high‑quality datasets can elicit rather than create reasoning ability.
Self‑Play, “Chatbot‑Zero,” and Code Proofs
- People compare AlphaGo Zero–style self‑play to LLMs and ask why we don’t have a “chatbot‑zero.”
- Objections: conversations lack a clear win signal; pure self‑play would likely produce an idiosyncratic language unintelligible to humans.
- DeepSeek‑R1‑Zero is discussed as a partial analogue: RL only, no CoT SFT, but still dependent on a heavily pretrained base model and labeled problems.
- A detailed subthread imagines using RL/LLMs to generate formal code correctness proofs, with SMT solvers as the objective signal; concern about safely running arbitrary generated code is raised.
Math Reasoning, Theorem Proving, and Latent Skills
- One tension: claims that LLMs can’t generalize in theorem proving vs this and related papers arguing models already contain rich mathematical knowledge that needs elicitation.
- Some argue human knowledge is finite enough to be pattern‑matched; others note models still fail on niche expert domains.
- A popular mental model: pretraining builds latent mathematical competence, but because the internet is mostly non‑reasoning text, models must be nudged (e.g., with a few curated CoT examples) to reliably use those circuits.
“Less Is More” Caveats and Data Curation
- Strong criticism: the 817 math examples were distilled from ~10M using powerful reasoning models (e.g., R1), and the base Qwen model was already trained on large curated math datasets.
- Thus, “less data” is contingent on:
- A huge, high‑quality pretrained model.
- A very expensive selection pipeline driven by stronger models.
- Many liken this to human textbooks: generations of effort distill millions of problems into a few hundred maximally instructive ones.
- Some see this as scientifically important (elicitation threshold, pedagogy of LLMs); others say the title overstates “less is more” and want full performance‑vs‑data curves.
Tools, Arithmetic, and True Reasoning
- Debate on whether LLMs should be perfect calculators: most argue they should call external tools (calculators, Python, SMT, theorem provers) rather than emulate exact arithmetic internally.
- This feeds into a broader skepticism: current systems often “mimic” reasoning and must be paired with verifiers, especially for safety‑critical tasks.
Broader Implications
- Comments imagine:
- Iterative power transfer from huge to small models via distilled reasoning datasets.
- Swarms of small, specialized submodels collaborating.
- Using curated LIMO‑style sets as human teaching material.
- There is also concern about using LLMs’ malleable “reasoning” for advertising and political persuasion, given how easily preferences can be embedded in prompts or training.