Jepsen: NATS 2.12.1

Initial reactions and related resources

  • Some readers initially misread “Jepsen NATS” as aviation-related; others link to a recent Jepsen/Antithesis distributed-systems glossary as useful background.

Fsync, durability, and performance tradeoffs

  • Major focus on “lazy fsync”: NATS JetStream’s default is to flush to disk every two minutes while acknowledging writes immediately.
  • Many see this as benchmark-driven and dangerous; a recurring view is that systems should default to safe durability and let users explicitly opt into “fast but risky.”
  • Others argue many workloads don’t need strict durability and that batching fsyncs for throughput is normal in filesystems and databases.
  • Several comments describe batching/group-commit strategies (similar to Postgres, Cassandra, etc.) that can preserve both safety and throughput, criticizing a fixed multi-minute timer as extreme.

NATS JetStream behavior and Jepsen findings

  • Commenters highlight Jepsen results: acknowledged messages can be lost, single-bit corruption can cause large data loss, snapshot corruption can cascade into stream deletion, and split-brain scenarios can persist.
  • Many are surprised at how fragile JetStream is to simple corruption and membership changes, especially given marketing claims of durability and “store and replay.”
  • Some note that NATS core is explicitly best-effort/ephemeral, but JetStream is promoted as persistent; mixing those mental models is seen as dangerous.

Comparisons with other systems and “safe defaults”

  • Comparisons to early MongoDB and its durability tradeoffs recur.
  • Discussion contrasts NATS with Kafka, Redis (including Redis Streams), MQTT, Postgres, SQLite, CockroachDB, FoundationDB, etc., focusing on when they acknowledge writes and what guarantees that implies.
  • There is disagreement over how common “acknowledged-but-not-durable” defaults are; some claim it’s widespread, others say it’s not acceptable for a system marketing durability.

Theory vs pragmatism and ecosystem responses

  • Thread debates “overcomplicated theory” vs hacker pragmatism: some argue ignoring distributed-systems theory repeatedly leads to disastrous bugs; others warn against perfectionism blocking value.
  • NATS project responses on GitHub are critiqued as underestimating real failure modes.
  • A few suggest alternatives (Kafka/Redpanda, Redis, custom builds, s2.dev) and praise Jepsen’s role in independent verification.