Defeating Nondeterminism in LLM Inference

Hardware & Software Sources of Nondeterminism

  • Deterministic behavior is relatively achievable on a single machine with fixed drivers and seeds, but very hard across:
    • Different GPU/TPU generations, drivers, and compiler versions that may reorder operations or change tiling.
    • Heterogeneous and multi-node clusters, where collectives and reduction operations introduce additional variance.
  • IEEE‑754 helps but doesn’t guarantee identical behavior; floating-point summation is non-associative, so kernel details matter.
  • Existing frameworks (e.g., PyTorch deterministic modes) mainly address run-to-run determinism with fixed batch sizes, not serving-time variability.

Batch Invariance & Large-Scale Serving

  • The core issue discussed is “batch invariance”: outputs changing when the same request is served in different batch sizes or with different parallel requests.
  • vLLM-style high-throughput serving and MoE routing can make outputs depend on batch composition, even at temperature 0.
  • Some commenters note these effects are known in JAX/XLA and multi-GPU work, but appreciate the clear exposition and open-source kernels.

Determinism vs Sampling & Probabilistic Nature

  • Several people argue “LLMs are deterministic” at the mathematical level: they output a distribution; any nondeterminism comes from:
    • Sampling (which can itself be deterministic with fixed seeds), and
    • Numeric differences in implementations.
  • Others highlight that greedy decoding (temperature 0) harms quality, and determinism does not require temp=0 if RNG is controlled.
  • There’s debate over whether numeric nondeterminism is a real LLM problem or mainly an infra/scale artifact.

Why Determinism Matters (and Where It Falls Short)

  • Strong support for determinism in:
    • Debugging and bug reproduction, regression tests, red teaming.
    • On-policy RL, where bitwise-identical training vs inference is valuable.
    • Tool-using/agentic systems, CI checks, and validation pipelines.
    • Sharing prompts, reproducible experiments, and detecting model swaps by providers.
  • Skeptics argue that “closed-system” determinism doesn’t address:
    • Sensitivity to preceding context (which is itself input).
    • Fragility to small prompt rephrasings or formatting changes.
    • The deeper need for semantic consistency across semantically equivalent prompts.

Philosophical & Meta Discussion

  • Multiple threads contrast:
    • Determinism vs ambiguity (language is inherently ambiguous, but deterministic mapping from exact tokens to tokens is still useful).
    • Reproducibility (bitwise identical) vs replicability (similar behavior under slightly varied conditions), with some saying the latter matters more.
  • Mixed views on the article and company:
    • Some see it as solid engineering craft and a promising sign.
    • Others think it’s well-known territory and modest output for a heavily funded startup.