BERT is just a single text diffusion step
Connection between BERT/MLM and diffusion
- Many commenters like the framing that masked language modeling (MLM) is essentially a single denoising step of a diffusion process.
- Several note this connection has been made before in papers on text diffusion and generative MLMs; the post is praised more for its clarity and simplicity than for novelty.
- Some argue the “is this diffusion or MLM?” taxonomy is unhelpful; what matters is whether the procedure works, not the label.
Noise, corruption, and token semantics
- A key distinction raised: continuous diffusion adds smooth noise, whereas in text you must corrupt discrete symbols.
- Simple random corruption (e.g., random bytes or tokens) is easy but may not teach robustness to realistic model “mistakes,” which are usually semantically related errors.
- Several attempts and papers tried semantic corruption (e.g., “quick brown fox” → “speedy black dog”), but masking often turned out easier for models to invert.
Diffusion vs autoregressive LLMs and human cognition
- One camp feels diffusion-like, iterative refinement is more “brain-like” than token-by-token generation, matching personal experience of drafting and revising.
- Others push back: humans still emit words sequentially; internal planning and revision happen in a latent, higher-level space, not literally as word-diffusion.
- Long subthread debates whether autoregressive models “plan ahead.” Cited interpretability work suggests they maintain latent features for future rhyme or structure.
- There is disagreement over whether re-evaluating context each token (with KV cache) counts as genuine planning or “starting anew with memory.”
Editing, backtracking, and code applications
- Diffusion-style models naturally support in-place editing: masking regions to refine or correct them instead of only appending tokens.
- This is seen as especially promising for code editing and inline completion, where you want to revise existing text, not just extend it.
- Commenters note that diffusion can already reintroduce noise and delete tokens; ideas include logprob-based masking schedules and explicit expand/delete tokens for Levenshtein-like edits.
Design challenges and open directions
- Discrete tokens force diffusion into embedding space, making training more complex than pixel-level image diffusion.
- People are interested in:
- Starting from random tokens vs full masks.
- Hybrid models combining continuous latent diffusion with autoregressive transformers.
- Comparisons with ELECTRA/DeBERTa and availability of open text-diffusion bases for fine-tuning, especially on code.