LIMO: Less Is More for Reasoning

Compressibility of Reasoning

  • Several comments speculate that there may be only a small set of generic “reasoning patterns,” but effective reasoning also needs domain‑specific steps (e.g., across math subfields) and strategies for handling impasses.
  • The LIMO result is interpreted as evidence that much of this structure is already present in large pretrained models; small, high‑quality datasets can elicit rather than create reasoning ability.

Self‑Play, “Chatbot‑Zero,” and Code Proofs

  • People compare AlphaGo Zero–style self‑play to LLMs and ask why we don’t have a “chatbot‑zero.”
  • Objections: conversations lack a clear win signal; pure self‑play would likely produce an idiosyncratic language unintelligible to humans.
  • DeepSeek‑R1‑Zero is discussed as a partial analogue: RL only, no CoT SFT, but still dependent on a heavily pretrained base model and labeled problems.
  • A detailed subthread imagines using RL/LLMs to generate formal code correctness proofs, with SMT solvers as the objective signal; concern about safely running arbitrary generated code is raised.

Math Reasoning, Theorem Proving, and Latent Skills

  • One tension: claims that LLMs can’t generalize in theorem proving vs this and related papers arguing models already contain rich mathematical knowledge that needs elicitation.
  • Some argue human knowledge is finite enough to be pattern‑matched; others note models still fail on niche expert domains.
  • A popular mental model: pretraining builds latent mathematical competence, but because the internet is mostly non‑reasoning text, models must be nudged (e.g., with a few curated CoT examples) to reliably use those circuits.

“Less Is More” Caveats and Data Curation

  • Strong criticism: the 817 math examples were distilled from ~10M using powerful reasoning models (e.g., R1), and the base Qwen model was already trained on large curated math datasets.
  • Thus, “less data” is contingent on:
    • A huge, high‑quality pretrained model.
    • A very expensive selection pipeline driven by stronger models.
  • Many liken this to human textbooks: generations of effort distill millions of problems into a few hundred maximally instructive ones.
  • Some see this as scientifically important (elicitation threshold, pedagogy of LLMs); others say the title overstates “less is more” and want full performance‑vs‑data curves.

Tools, Arithmetic, and True Reasoning

  • Debate on whether LLMs should be perfect calculators: most argue they should call external tools (calculators, Python, SMT, theorem provers) rather than emulate exact arithmetic internally.
  • This feeds into a broader skepticism: current systems often “mimic” reasoning and must be paired with verifiers, especially for safety‑critical tasks.

Broader Implications

  • Comments imagine:
    • Iterative power transfer from huge to small models via distilled reasoning datasets.
    • Swarms of small, specialized submodels collaborating.
    • Using curated LIMO‑style sets as human teaching material.
  • There is also concern about using LLMs’ malleable “reasoning” for advertising and political persuasion, given how easily preferences can be embedded in prompts or training.