Jepsen: NATS 2.12.1
Initial reactions and related resources
- Some readers initially misread “Jepsen NATS” as aviation-related; others link to a recent Jepsen/Antithesis distributed-systems glossary as useful background.
Fsync, durability, and performance tradeoffs
- Major focus on “lazy fsync”: NATS JetStream’s default is to flush to disk every two minutes while acknowledging writes immediately.
- Many see this as benchmark-driven and dangerous; a recurring view is that systems should default to safe durability and let users explicitly opt into “fast but risky.”
- Others argue many workloads don’t need strict durability and that batching fsyncs for throughput is normal in filesystems and databases.
- Several comments describe batching/group-commit strategies (similar to Postgres, Cassandra, etc.) that can preserve both safety and throughput, criticizing a fixed multi-minute timer as extreme.
NATS JetStream behavior and Jepsen findings
- Commenters highlight Jepsen results: acknowledged messages can be lost, single-bit corruption can cause large data loss, snapshot corruption can cascade into stream deletion, and split-brain scenarios can persist.
- Many are surprised at how fragile JetStream is to simple corruption and membership changes, especially given marketing claims of durability and “store and replay.”
- Some note that NATS core is explicitly best-effort/ephemeral, but JetStream is promoted as persistent; mixing those mental models is seen as dangerous.
Comparisons with other systems and “safe defaults”
- Comparisons to early MongoDB and its durability tradeoffs recur.
- Discussion contrasts NATS with Kafka, Redis (including Redis Streams), MQTT, Postgres, SQLite, CockroachDB, FoundationDB, etc., focusing on when they acknowledge writes and what guarantees that implies.
- There is disagreement over how common “acknowledged-but-not-durable” defaults are; some claim it’s widespread, others say it’s not acceptable for a system marketing durability.
Theory vs pragmatism and ecosystem responses
- Thread debates “overcomplicated theory” vs hacker pragmatism: some argue ignoring distributed-systems theory repeatedly leads to disastrous bugs; others warn against perfectionism blocking value.
- NATS project responses on GitHub are critiqued as underestimating real failure modes.
- A few suggest alternatives (Kafka/Redpanda, Redis, custom builds, s2.dev) and praise Jepsen’s role in independent verification.