2025-10-20

BERT is just a single text diffusion step

Connection between BERT/MLM and diffusion

Many commenters like the framing that masked language modeling (MLM) is essentially a single denoising step of a diffusion process.
Several note this connection has been made before in papers on text diffusion and generative MLMs; the post is praised more for its clarity and simplicity than for novelty.
Some argue the “is this diffusion or MLM?” taxonomy is unhelpful; what matters is whether the procedure works, not the label.

Noise, corruption, and token semantics

A key distinction raised: continuous diffusion adds smooth noise, whereas in text you must corrupt discrete symbols.
Simple random corruption (e.g., random bytes or tokens) is easy but may not teach robustness to realistic model “mistakes,” which are usually semantically related errors.
Several attempts and papers tried semantic corruption (e.g., “quick brown fox” → “speedy black dog”), but masking often turned out easier for models to invert.

Diffusion vs autoregressive LLMs and human cognition

One camp feels diffusion-like, iterative refinement is more “brain-like” than token-by-token generation, matching personal experience of drafting and revising.
Others push back: humans still emit words sequentially; internal planning and revision happen in a latent, higher-level space, not literally as word-diffusion.
Long subthread debates whether autoregressive models “plan ahead.” Cited interpretability work suggests they maintain latent features for future rhyme or structure.
There is disagreement over whether re-evaluating context each token (with KV cache) counts as genuine planning or “starting anew with memory.”

Editing, backtracking, and code applications

Diffusion-style models naturally support in-place editing: masking regions to refine or correct them instead of only appending tokens.
This is seen as especially promising for code editing and inline completion, where you want to revise existing text, not just extend it.
Commenters note that diffusion can already reintroduce noise and delete tokens; ideas include logprob-based masking schedules and explicit expand/delete tokens for Levenshtein-like edits.

Design challenges and open directions

Discrete tokens force diffusion into embedding space, making training more complex than pixel-level image diffusion.
People are interested in:
- Starting from random tokens vs full masks.
- Hybrid models combining continuous latent diffusion with autoregressive transformers.
- Comparisons with ELECTRA/DeBERTa and availability of open text-diffusion bases for fine-tuning, especially on code.

Related topics