2025-09-10

Defeating Nondeterminism in LLM Inference

Hardware & Software Sources of Nondeterminism

Deterministic behavior is relatively achievable on a single machine with fixed drivers and seeds, but very hard across:
- Different GPU/TPU generations, drivers, and compiler versions that may reorder operations or change tiling.
- Heterogeneous and multi-node clusters, where collectives and reduction operations introduce additional variance.
IEEE‑754 helps but doesn’t guarantee identical behavior; floating-point summation is non-associative, so kernel details matter.
Existing frameworks (e.g., PyTorch deterministic modes) mainly address run-to-run determinism with fixed batch sizes, not serving-time variability.

Batch Invariance & Large-Scale Serving

The core issue discussed is “batch invariance”: outputs changing when the same request is served in different batch sizes or with different parallel requests.
vLLM-style high-throughput serving and MoE routing can make outputs depend on batch composition, even at temperature 0.
Some commenters note these effects are known in JAX/XLA and multi-GPU work, but appreciate the clear exposition and open-source kernels.

Determinism vs Sampling & Probabilistic Nature

Several people argue “LLMs are deterministic” at the mathematical level: they output a distribution; any nondeterminism comes from:
- Sampling (which can itself be deterministic with fixed seeds), and
- Numeric differences in implementations.
Others highlight that greedy decoding (temperature 0) harms quality, and determinism does not require temp=0 if RNG is controlled.
There’s debate over whether numeric nondeterminism is a real LLM problem or mainly an infra/scale artifact.

Why Determinism Matters (and Where It Falls Short)

Strong support for determinism in:
- Debugging and bug reproduction, regression tests, red teaming.
- On-policy RL, where bitwise-identical training vs inference is valuable.
- Tool-using/agentic systems, CI checks, and validation pipelines.
- Sharing prompts, reproducible experiments, and detecting model swaps by providers.
Skeptics argue that “closed-system” determinism doesn’t address:
- Sensitivity to preceding context (which is itself input).
- Fragility to small prompt rephrasings or formatting changes.
- The deeper need for semantic consistency across semantically equivalent prompts.

Philosophical & Meta Discussion

Multiple threads contrast:
- Determinism vs ambiguity (language is inherently ambiguous, but deterministic mapping from exact tokens to tokens is still useful).
- Reproducibility (bitwise identical) vs replicability (similar behavior under slightly varied conditions), with some saying the latter matters more.
Mixed views on the article and company:
- Some see it as solid engineering craft and a promising sign.
- Others think it’s well-known territory and modest output for a heavily funded startup.

Related topics