Systems Correctness Practices at Amazon Web Services

Use of TLA+ and Formal Methods

  • Several commenters describe practical wins from TLA+ beyond “big distributed systems”: games (Snake), queues, embedded systems, and modeling misbehaving hardware.
  • Key idea: model the system as state transitions with invariants (e.g., “snake length ≥ 2”). Model checking then explores executions to find violating traces that are very unlikely to appear in tests.
  • Some clarify that TLA+ can be used purely as a proof system (not just model checking) and that proofs apply to infinite behaviors.
  • Others stress limits: you only verify what you specify; there are gaps between model and implementation and between real needs and the properties you think to state.

Deterministic Simulation Testing (DST)

  • Deterministic simulation of distributed systems is widely praised as “amazing.” AWS’s approach—single-threaded simulation controlling scheduling, timing, and network—is compared to Loom (Rust), TigerBeetle’s simulator, FoundationDB, Antithesis, Keploy, Coyote (.NET), Java tools, Haskell’s Dejafu, rr, and past projects like Hermit and Corensic.
  • There is debate about feasibility: exhaustively exploring all orderings is impossible for nontrivial systems, but capturing and replaying specific bad orderings is highly valuable.
  • Retrofitting determinism onto arbitrary software is seen as hard; tight coupling to frameworks or runtimes has historically limited adoption.

Error Handling and Catastrophic Failures

  • The “92% of catastrophic failures from mishandled nonfatal errors” statistic strongly resonates. Many argue that error paths get far less design and testing attention than happy paths.
  • Best practices discussed: treat errors as first-class, use precise error types/status codes, design for recovery semantics (retries, dead-letter queues, fallbacks), and avoid turning fatal errors into silent nulls.
  • Distributed systems complicate “just crash”: crashes can cause restart loops or inconsistent state unless failure handling is carefully modeled.

Accessibility, Tooling, and Adoption

  • Some readers want simple “hello world” examples of TLA+ or P in the article; otherwise the techniques feel like heavy overhead vs. “good design and testing.” Others reply that testing can never cover the state space these tools can.
  • AI tools (e.g., large models) are reported to help generate TLA+ specs from existing code and find real bugs, sparking speculation that AI could greatly accelerate rigorous testing and formal methods.

P, Coyote, and Control Planes

  • P and its successors (P#, Coyote) are discussed as being used both for abstract model checking and for testing real production services, especially state-machine-based control planes.
  • Some question whether generating production code from P is still done; current emphasis seems more on testing than codegen.
  • There’s criticism that building P/Coyote on C# reduces approachability compared to Go/Java ecosystems, although the underlying goal—making formal methods more usable—is applauded.

S3, GCS, and Engineering Impressiveness

  • S3’s long history, near-flawless operation at enormous scale, and migration to global strong read-after-write consistency are widely admired.
  • Some argue Google Cloud Storage had strong consistency earlier and is more “cleanly” engineered; others counter that S3’s scale, age, and compatibility make its evolution more impressive.

Contrasting with Other Practices

  • There is frustration that formal methods are often dismissed in industry, while practices like TDD (criticized here as lacking formal grounding and sometimes quasi-religious) gained wide adoption.
  • Property-based testing and fuzzing are generally accepted as “semi-formal”; runtime monitoring is more contentious, seen as semi-formal only when it explicitly checks specified temporal properties.