Embarrassingly simple self-distillation improves code generation

Overview of the SSD Idea

  • Paper proposes “simple self-distillation” (SSD) for code models:
    – Sample the base model with fixed temperature and truncation / top‑k / top‑p.
    – Fine‑tune the same model on its own raw outputs using standard cross‑entropy.
  • No correctness checking, execution, or reward signal is used; even wrong or incoherent samples are kept.
  • Reported gains on hard coding benchmarks are large, especially for mid‑sized models.

Why It Might Work (Fork/Lock, Precision–Exploration)

  • Discussion focuses on the “fork vs lock” view of code:
    – “Fork” positions: many plausible next tokens (multiple solution paths).
    – “Lock” positions: only a few syntactically/semantically valid tokens.
  • Global decoding settings (temperature, truncation) force a compromise between exploration (forks) and precision (locks).
  • SSD is argued to “bake in” a better balance: sharper distributions where there’s one right token, broader where multiple are valid.
  • One analogy: sleep consolidation / synaptic pruning — the model replays its own noisy behavior and strengthens useful patterns while pruning distractor tails, even when outputs are partly gibberish.

Relation to Self-Distillation and “Model Collapse”

  • Several note this is a specific instance of self‑distillation; related work (e.g., earlier self‑distillation fine‑tuning methods) is mentioned and some feel it deserved clearer positioning and credit.
  • Contrast is drawn with claims that training on model‑generated data causes “model collapse”:
    – Commenters argue collapse arises from indiscriminate, recursive reuse of outputs.
    – Targeted, on‑policy self‑distillation with controlled sampling is seen as different and potentially beneficial.

Evaluation, Benchmarks, and Limitations

  • Some are impressed by the pass@1 jump; others note the absolute score (~50%) sounds weak.
  • Explanation: hard benchmarks are intentionally calibrated so even strong models sit near 50%, making relative gains meaningful.
  • Concerns raised:
    – Possible overlap/contamination between training and test benchmarks is not clearly documented.
    – Missing baseline: comparing SSD‑trained model to the original model simply decoded with the same “teacher” sampling settings.
    – Risk that this mainly overfits to specific coding benchmarks without checking other capabilities.
  • One commenter notes the preprint date and treats results as promising but not settled.

Broader Reflections and Tools

  • Many highlight how small, “embarrassingly simple” tweaks can yield big gains, fitting a broader pattern in ML.
  • Discussion branches into: interpretability of LLM internals, adaptive per‑token compute/temperature, grammar‑aware decoding, and combining LLMs with deterministic tools (LSP, linters, tests).
  • Some expect a long tail of similar tricks that make strong, cheap, locally run coding models increasingly viable.

Naming, Style, and Humor

  • Debate over the title (“Embarrassingly simple…”); some find it cringe, others note “embarrassingly” is a CS term of art (as in embarrassingly parallel).
  • “SSD” as an acronym conflicts with solid‑state drives, spawning joking alternative acronyms and meta‑humor about three‑letter acronyms in research.