Systems Correctness Practices at Amazon Web Services
Use of TLA+ and Formal Methods
- Several commenters describe practical wins from TLA+ beyond “big distributed systems”: games (Snake), queues, embedded systems, and modeling misbehaving hardware.
- Key idea: model the system as state transitions with invariants (e.g., “snake length ≥ 2”). Model checking then explores executions to find violating traces that are very unlikely to appear in tests.
- Some clarify that TLA+ can be used purely as a proof system (not just model checking) and that proofs apply to infinite behaviors.
- Others stress limits: you only verify what you specify; there are gaps between model and implementation and between real needs and the properties you think to state.
Deterministic Simulation Testing (DST)
- Deterministic simulation of distributed systems is widely praised as “amazing.” AWS’s approach—single-threaded simulation controlling scheduling, timing, and network—is compared to Loom (Rust), TigerBeetle’s simulator, FoundationDB, Antithesis, Keploy, Coyote (.NET), Java tools, Haskell’s Dejafu, rr, and past projects like Hermit and Corensic.
- There is debate about feasibility: exhaustively exploring all orderings is impossible for nontrivial systems, but capturing and replaying specific bad orderings is highly valuable.
- Retrofitting determinism onto arbitrary software is seen as hard; tight coupling to frameworks or runtimes has historically limited adoption.
Error Handling and Catastrophic Failures
- The “92% of catastrophic failures from mishandled nonfatal errors” statistic strongly resonates. Many argue that error paths get far less design and testing attention than happy paths.
- Best practices discussed: treat errors as first-class, use precise error types/status codes, design for recovery semantics (retries, dead-letter queues, fallbacks), and avoid turning fatal errors into silent nulls.
- Distributed systems complicate “just crash”: crashes can cause restart loops or inconsistent state unless failure handling is carefully modeled.
Accessibility, Tooling, and Adoption
- Some readers want simple “hello world” examples of TLA+ or P in the article; otherwise the techniques feel like heavy overhead vs. “good design and testing.” Others reply that testing can never cover the state space these tools can.
- AI tools (e.g., large models) are reported to help generate TLA+ specs from existing code and find real bugs, sparking speculation that AI could greatly accelerate rigorous testing and formal methods.
P, Coyote, and Control Planes
- P and its successors (P#, Coyote) are discussed as being used both for abstract model checking and for testing real production services, especially state-machine-based control planes.
- Some question whether generating production code from P is still done; current emphasis seems more on testing than codegen.
- There’s criticism that building P/Coyote on C# reduces approachability compared to Go/Java ecosystems, although the underlying goal—making formal methods more usable—is applauded.
S3, GCS, and Engineering Impressiveness
- S3’s long history, near-flawless operation at enormous scale, and migration to global strong read-after-write consistency are widely admired.
- Some argue Google Cloud Storage had strong consistency earlier and is more “cleanly” engineered; others counter that S3’s scale, age, and compatibility make its evolution more impressive.
Contrasting with Other Practices
- There is frustration that formal methods are often dismissed in industry, while practices like TDD (criticized here as lacking formal grounding and sometimes quasi-religious) gained wide adoption.
- Property-based testing and fuzzing are generally accepted as “semi-formal”; runtime monitoring is more contentious, seen as semi-formal only when it explicitly checks specified temporal properties.