Embarrassingly simple self-distillation improves code generation
Overview of the SSD Idea
- Paper proposes “simple self-distillation” (SSD) for code models:
– Sample the base model with fixed temperature and truncation / top‑k / top‑p.
– Fine‑tune the same model on its own raw outputs using standard cross‑entropy. - No correctness checking, execution, or reward signal is used; even wrong or incoherent samples are kept.
- Reported gains on hard coding benchmarks are large, especially for mid‑sized models.
Why It Might Work (Fork/Lock, Precision–Exploration)
- Discussion focuses on the “fork vs lock” view of code:
– “Fork” positions: many plausible next tokens (multiple solution paths).
– “Lock” positions: only a few syntactically/semantically valid tokens. - Global decoding settings (temperature, truncation) force a compromise between exploration (forks) and precision (locks).
- SSD is argued to “bake in” a better balance: sharper distributions where there’s one right token, broader where multiple are valid.
- One analogy: sleep consolidation / synaptic pruning — the model replays its own noisy behavior and strengthens useful patterns while pruning distractor tails, even when outputs are partly gibberish.
Relation to Self-Distillation and “Model Collapse”
- Several note this is a specific instance of self‑distillation; related work (e.g., earlier self‑distillation fine‑tuning methods) is mentioned and some feel it deserved clearer positioning and credit.
- Contrast is drawn with claims that training on model‑generated data causes “model collapse”:
– Commenters argue collapse arises from indiscriminate, recursive reuse of outputs.
– Targeted, on‑policy self‑distillation with controlled sampling is seen as different and potentially beneficial.
Evaluation, Benchmarks, and Limitations
- Some are impressed by the pass@1 jump; others note the absolute score (~50%) sounds weak.
- Explanation: hard benchmarks are intentionally calibrated so even strong models sit near 50%, making relative gains meaningful.
- Concerns raised:
– Possible overlap/contamination between training and test benchmarks is not clearly documented.
– Missing baseline: comparing SSD‑trained model to the original model simply decoded with the same “teacher” sampling settings.
– Risk that this mainly overfits to specific coding benchmarks without checking other capabilities. - One commenter notes the preprint date and treats results as promising but not settled.
Broader Reflections and Tools
- Many highlight how small, “embarrassingly simple” tweaks can yield big gains, fitting a broader pattern in ML.
- Discussion branches into: interpretability of LLM internals, adaptive per‑token compute/temperature, grammar‑aware decoding, and combining LLMs with deterministic tools (LSP, linters, tests).
- Some expect a long tail of similar tricks that make strong, cheap, locally run coding models increasingly viable.
Naming, Style, and Humor
- Debate over the title (“Embarrassingly simple…”); some find it cringe, others note “embarrassingly” is a CS term of art (as in embarrassingly parallel).
- “SSD” as an acronym conflicts with solid‑state drives, spawning joking alternative acronyms and meta‑humor about three‑letter acronyms in research.