2025-02-09

LIMO: Less Is More for Reasoning

Compressibility of Reasoning

Several comments speculate that there may be only a small set of generic “reasoning patterns,” but effective reasoning also needs domain‑specific steps (e.g., across math subfields) and strategies for handling impasses.
The LIMO result is interpreted as evidence that much of this structure is already present in large pretrained models; small, high‑quality datasets can elicit rather than create reasoning ability.

Self‑Play, “Chatbot‑Zero,” and Code Proofs

People compare AlphaGo Zero–style self‑play to LLMs and ask why we don’t have a “chatbot‑zero.”
Objections: conversations lack a clear win signal; pure self‑play would likely produce an idiosyncratic language unintelligible to humans.
DeepSeek‑R1‑Zero is discussed as a partial analogue: RL only, no CoT SFT, but still dependent on a heavily pretrained base model and labeled problems.
A detailed subthread imagines using RL/LLMs to generate formal code correctness proofs, with SMT solvers as the objective signal; concern about safely running arbitrary generated code is raised.

Math Reasoning, Theorem Proving, and Latent Skills

One tension: claims that LLMs can’t generalize in theorem proving vs this and related papers arguing models already contain rich mathematical knowledge that needs elicitation.
Some argue human knowledge is finite enough to be pattern‑matched; others note models still fail on niche expert domains.
A popular mental model: pretraining builds latent mathematical competence, but because the internet is mostly non‑reasoning text, models must be nudged (e.g., with a few curated CoT examples) to reliably use those circuits.

“Less Is More” Caveats and Data Curation

Strong criticism: the 817 math examples were distilled from ~10M using powerful reasoning models (e.g., R1), and the base Qwen model was already trained on large curated math datasets.
Thus, “less data” is contingent on:
- A huge, high‑quality pretrained model.
- A very expensive selection pipeline driven by stronger models.
Many liken this to human textbooks: generations of effort distill millions of problems into a few hundred maximally instructive ones.
Some see this as scientifically important (elicitation threshold, pedagogy of LLMs); others say the title overstates “less is more” and want full performance‑vs‑data curves.

Tools, Arithmetic, and True Reasoning

Debate on whether LLMs should be perfect calculators: most argue they should call external tools (calculators, Python, SMT, theorem provers) rather than emulate exact arithmetic internally.
This feeds into a broader skepticism: current systems often “mimic” reasoning and must be paired with verifiers, especially for safety‑critical tasks.

Broader Implications

Comments imagine:
- Iterative power transfer from huge to small models via distilled reasoning datasets.
- Swarms of small, specialized submodels collaborating.
- Using curated LIMO‑style sets as human teaching material.
There is also concern about using LLMs’ malleable “reasoning” for advertising and political persuasion, given how easily preferences can be embedded in prompts or training.

Related topics