Ask HN: Any insider takes on Yann LeCun's push against current architectures?

Perceived Limits of Current LLM Architectures

  • Many comments restate LeCun’s core critique as: autoregressive, token-by-token generation with fixed weights leads to error accumulation and makes systematic self-correction and “global” constraint satisfaction hard.
  • Others respond that transformers are Turing-complete and, in theory, can implement arbitrary algorithms and error correction; in practice, current training and inference setups don’t realize this reliably and require task‑specific “whack‑a‑mole” fixes.

Hallucinations, Uncertainty, and “I Don’t Know”

  • One camp claims transformers fundamentally lack a robust notion of uncertainty: they always pick a token, can’t “backtrack everything,” and don’t natively emit “I don’t know.”
  • Counter‑arguments:
    • Models internally represent uncertainty as flat probability distributions and can be trained (via fine‑tuning or RL) to say “I don’t know” when they lack knowledge.
    • Research shows hidden states encode “not knowing,” but standard QA fine‑tuning suppresses that expression.
  • Several propose architectural hacks: backspace tokens, explicit confidence heads per layer, branching/beam‑like generation, or self‑reflection frameworks (e.g., SelfRAG) to decide when to retrieve or abstain.
  • Others argue hallucinations are partly desirable creativity; the real issue is calibrating when outputs are guesses vs grounded facts.

Energy-Based Models, World Models, and LeCun’s Focus

  • Energy-based models (EBMs) are described as assigning low “energy” to globally consistent configurations, potentially enabling better uncertainty estimates and constraint satisfaction than token‑local probabilities.
  • LeCun’s broader agenda is seen as:
    • Learning world models from rich, multimodal, interactive data (especially vision), not just text.
    • Using energy minimization / JEPA‑like objectives to move away from pure memorization.
  • Practitioners note EBMs are currently far more resource‑intensive and not yet competitive at scale, though some groups are actively trying to change this.

Biological Plausibility, Efficiency, and Continual Learning

  • Many point to the brain’s ~25W energy use and continual, online learning as evidence current LLM training/inference is wildly inefficient and biologically implausible, implying large optimization headroom.
  • Others invoke the “bitter lesson”: biological plausibility isn’t necessarily a good design prior; compute‑heavy, simple methods often win.
  • Continual learning researchers say catastrophic forgetting is mostly solved in toy settings but hasn’t been pushed seriously at LLM scale; an architecture that can update itself in deployment without collapse is widely seen as necessary for longer‑term progress.

Alternative Architectures and Experimental Directions

  • Mentioned directions include:
    • Diffusion language models (e.g., LLaDA/SEDD‑style) that sample whole sequences or blocks in parallel and may trade bandwidth for fewer steps.
    • Sentence‑level or “concept” models that operate on higher‑level units than tokens.
    • Recursive/branching “thought trees,” test‑time training, world‑model‑centric agents, and multi‑head predictive architectures like Hydra.
  • Several commenters think current transformers are a powerful but temporary step on an S‑curve; others suspect further scaling and better training schedules could still yield major surprises.

Economic and Social Path Dependence

  • There is broad agreement that industry incentives create strong path dependence:
    • No major lab wants to ship something that’s weaker than current leaders on benchmarks.
    • UX and integration matter more than marginal eval gains, so many promising but non‑dominant architectures (RWKV, Mamba‑like, EBMs, diffusion LMs) struggle to gain traction.
  • Overall, the thread reflects a split: some see LLMs as a dead‑end without new architectures; others view them as a flexible substrate that still has a lot of unexplored potential, with “energy minimization” more a re‑framing than a fundamentally different paradigm.