2025-12-08

Jepsen: NATS 2.12.1

Initial reactions and related resources

Some readers initially misread “Jepsen NATS” as aviation-related; others link to a recent Jepsen/Antithesis distributed-systems glossary as useful background.

Fsync, durability, and performance tradeoffs

Major focus on “lazy fsync”: NATS JetStream’s default is to flush to disk every two minutes while acknowledging writes immediately.
Many see this as benchmark-driven and dangerous; a recurring view is that systems should default to safe durability and let users explicitly opt into “fast but risky.”
Others argue many workloads don’t need strict durability and that batching fsyncs for throughput is normal in filesystems and databases.
Several comments describe batching/group-commit strategies (similar to Postgres, Cassandra, etc.) that can preserve both safety and throughput, criticizing a fixed multi-minute timer as extreme.

NATS JetStream behavior and Jepsen findings

Commenters highlight Jepsen results: acknowledged messages can be lost, single-bit corruption can cause large data loss, snapshot corruption can cascade into stream deletion, and split-brain scenarios can persist.
Many are surprised at how fragile JetStream is to simple corruption and membership changes, especially given marketing claims of durability and “store and replay.”
Some note that NATS core is explicitly best-effort/ephemeral, but JetStream is promoted as persistent; mixing those mental models is seen as dangerous.

Comparisons with other systems and “safe defaults”

Comparisons to early MongoDB and its durability tradeoffs recur.
Discussion contrasts NATS with Kafka, Redis (including Redis Streams), MQTT, Postgres, SQLite, CockroachDB, FoundationDB, etc., focusing on when they acknowledge writes and what guarantees that implies.
There is disagreement over how common “acknowledged-but-not-durable” defaults are; some claim it’s widespread, others say it’s not acceptable for a system marketing durability.

Theory vs pragmatism and ecosystem responses

Thread debates “overcomplicated theory” vs hacker pragmatism: some argue ignoring distributed-systems theory repeatedly leads to disastrous bugs; others warn against perfectionism blocking value.
NATS project responses on GitHub are critiqued as underestimating real failure modes.
A few suggest alternatives (Kafka/Redpanda, Redis, custom builds, s2.dev) and praise Jepsen’s role in independent verification.

Related topics