Defeating Nondeterminism in LLM Inference
Hardware & Software Sources of Nondeterminism
- Deterministic behavior is relatively achievable on a single machine with fixed drivers and seeds, but very hard across:
- Different GPU/TPU generations, drivers, and compiler versions that may reorder operations or change tiling.
- Heterogeneous and multi-node clusters, where collectives and reduction operations introduce additional variance.
- IEEE‑754 helps but doesn’t guarantee identical behavior; floating-point summation is non-associative, so kernel details matter.
- Existing frameworks (e.g., PyTorch deterministic modes) mainly address run-to-run determinism with fixed batch sizes, not serving-time variability.
Batch Invariance & Large-Scale Serving
- The core issue discussed is “batch invariance”: outputs changing when the same request is served in different batch sizes or with different parallel requests.
- vLLM-style high-throughput serving and MoE routing can make outputs depend on batch composition, even at temperature 0.
- Some commenters note these effects are known in JAX/XLA and multi-GPU work, but appreciate the clear exposition and open-source kernels.
Determinism vs Sampling & Probabilistic Nature
- Several people argue “LLMs are deterministic” at the mathematical level: they output a distribution; any nondeterminism comes from:
- Sampling (which can itself be deterministic with fixed seeds), and
- Numeric differences in implementations.
- Others highlight that greedy decoding (temperature 0) harms quality, and determinism does not require temp=0 if RNG is controlled.
- There’s debate over whether numeric nondeterminism is a real LLM problem or mainly an infra/scale artifact.
Why Determinism Matters (and Where It Falls Short)
- Strong support for determinism in:
- Debugging and bug reproduction, regression tests, red teaming.
- On-policy RL, where bitwise-identical training vs inference is valuable.
- Tool-using/agentic systems, CI checks, and validation pipelines.
- Sharing prompts, reproducible experiments, and detecting model swaps by providers.
- Skeptics argue that “closed-system” determinism doesn’t address:
- Sensitivity to preceding context (which is itself input).
- Fragility to small prompt rephrasings or formatting changes.
- The deeper need for semantic consistency across semantically equivalent prompts.
Philosophical & Meta Discussion
- Multiple threads contrast:
- Determinism vs ambiguity (language is inherently ambiguous, but deterministic mapping from exact tokens to tokens is still useful).
- Reproducibility (bitwise identical) vs replicability (similar behavior under slightly varied conditions), with some saying the latter matters more.
- Mixed views on the article and company:
- Some see it as solid engineering craft and a promising sign.
- Others think it’s well-known territory and modest output for a heavily funded startup.