2025-05-30

Systems Correctness Practices at Amazon Web Services

Use of TLA+ and Formal Methods

Several commenters describe practical wins from TLA+ beyond “big distributed systems”: games (Snake), queues, embedded systems, and modeling misbehaving hardware.
Key idea: model the system as state transitions with invariants (e.g., “snake length ≥ 2”). Model checking then explores executions to find violating traces that are very unlikely to appear in tests.
Some clarify that TLA+ can be used purely as a proof system (not just model checking) and that proofs apply to infinite behaviors.
Others stress limits: you only verify what you specify; there are gaps between model and implementation and between real needs and the properties you think to state.

Deterministic Simulation Testing (DST)

Deterministic simulation of distributed systems is widely praised as “amazing.” AWS’s approach—single-threaded simulation controlling scheduling, timing, and network—is compared to Loom (Rust), TigerBeetle’s simulator, FoundationDB, Antithesis, Keploy, Coyote (.NET), Java tools, Haskell’s Dejafu, rr, and past projects like Hermit and Corensic.
There is debate about feasibility: exhaustively exploring all orderings is impossible for nontrivial systems, but capturing and replaying specific bad orderings is highly valuable.
Retrofitting determinism onto arbitrary software is seen as hard; tight coupling to frameworks or runtimes has historically limited adoption.

Error Handling and Catastrophic Failures

The “92% of catastrophic failures from mishandled nonfatal errors” statistic strongly resonates. Many argue that error paths get far less design and testing attention than happy paths.
Best practices discussed: treat errors as first-class, use precise error types/status codes, design for recovery semantics (retries, dead-letter queues, fallbacks), and avoid turning fatal errors into silent nulls.
Distributed systems complicate “just crash”: crashes can cause restart loops or inconsistent state unless failure handling is carefully modeled.

Accessibility, Tooling, and Adoption

Some readers want simple “hello world” examples of TLA+ or P in the article; otherwise the techniques feel like heavy overhead vs. “good design and testing.” Others reply that testing can never cover the state space these tools can.
AI tools (e.g., large models) are reported to help generate TLA+ specs from existing code and find real bugs, sparking speculation that AI could greatly accelerate rigorous testing and formal methods.

P, Coyote, and Control Planes

P and its successors (P#, Coyote) are discussed as being used both for abstract model checking and for testing real production services, especially state-machine-based control planes.
Some question whether generating production code from P is still done; current emphasis seems more on testing than codegen.
There’s criticism that building P/Coyote on C# reduces approachability compared to Go/Java ecosystems, although the underlying goal—making formal methods more usable—is applauded.

S3, GCS, and Engineering Impressiveness

S3’s long history, near-flawless operation at enormous scale, and migration to global strong read-after-write consistency are widely admired.
Some argue Google Cloud Storage had strong consistency earlier and is more “cleanly” engineered; others counter that S3’s scale, age, and compatibility make its evolution more impressive.

Contrasting with Other Practices

There is frustration that formal methods are often dismissed in industry, while practices like TDD (criticized here as lacking formal grounding and sometimes quasi-religious) gained wide adoption.
Property-based testing and fuzzing are generally accepted as “semi-formal”; runtime monitoring is more contentious, seen as semi-formal only when it explicitly checks specified temporal properties.

Related topics