2026-04-04

Embarrassingly simple self-distillation improves code generation

Overview of the SSD Idea

Paper proposes “simple self-distillation” (SSD) for code models:
– Sample the base model with fixed temperature and truncation / top‑k / top‑p.
– Fine‑tune the same model on its own raw outputs using standard cross‑entropy.
No correctness checking, execution, or reward signal is used; even wrong or incoherent samples are kept.
Reported gains on hard coding benchmarks are large, especially for mid‑sized models.

Why It Might Work (Fork/Lock, Precision–Exploration)

Discussion focuses on the “fork vs lock” view of code:
– “Fork” positions: many plausible next tokens (multiple solution paths).
– “Lock” positions: only a few syntactically/semantically valid tokens.
Global decoding settings (temperature, truncation) force a compromise between exploration (forks) and precision (locks).
SSD is argued to “bake in” a better balance: sharper distributions where there’s one right token, broader where multiple are valid.
One analogy: sleep consolidation / synaptic pruning — the model replays its own noisy behavior and strengthens useful patterns while pruning distractor tails, even when outputs are partly gibberish.

Relation to Self-Distillation and “Model Collapse”

Several note this is a specific instance of self‑distillation; related work (e.g., earlier self‑distillation fine‑tuning methods) is mentioned and some feel it deserved clearer positioning and credit.
Contrast is drawn with claims that training on model‑generated data causes “model collapse”:
– Commenters argue collapse arises from indiscriminate, recursive reuse of outputs.
– Targeted, on‑policy self‑distillation with controlled sampling is seen as different and potentially beneficial.

Evaluation, Benchmarks, and Limitations

Some are impressed by the pass@1 jump; others note the absolute score (~50%) sounds weak.
Explanation: hard benchmarks are intentionally calibrated so even strong models sit near 50%, making relative gains meaningful.
Concerns raised:
– Possible overlap/contamination between training and test benchmarks is not clearly documented.
– Missing baseline: comparing SSD‑trained model to the original model simply decoded with the same “teacher” sampling settings.
– Risk that this mainly overfits to specific coding benchmarks without checking other capabilities.
One commenter notes the preprint date and treats results as promising but not settled.

Broader Reflections and Tools

Many highlight how small, “embarrassingly simple” tweaks can yield big gains, fitting a broader pattern in ML.
Discussion branches into: interpretability of LLM internals, adaptive per‑token compute/temperature, grammar‑aware decoding, and combining LLMs with deterministic tools (LSP, linters, tests).
Some expect a long tail of similar tricks that make strong, cheap, locally run coding models increasingly viable.

Naming, Style, and Humor

Debate over the title (“Embarrassingly simple…”); some find it cringe, others note “embarrassingly” is a CS term of art (as in embarrassingly parallel).
“SSD” as an acronym conflicts with solid‑state drives, spawning joking alternative acronyms and meta‑humor about three‑letter acronyms in research.

Related topics