2024-09-30

No such thing as exactly-once delivery

Core distinction: “delivery” vs “processing”

Major thread theme: people conflate “message delivery” with “message processing / committing side effects.”
One camp insists “exactly-once delivery” is impossible in failure-prone distributed systems.
Another says you can get “exactly-once processing” via idempotency, deduplication, counters, and transactions, as long as you acknowledge this is different from transport-level delivery.
Side effects (emails, database writes, external APIs) are where guarantees usually break down.

Limits, failures, and probabilities

Several comments stress that even “at-least-once” cannot be guaranteed in finite time when nodes, networks, or power can fail arbitrarily or partitions persist.
Systems can only drive the probability of loss/duplication arbitrarily low, not to zero.
References to Byzantine Generals and CAP: global, time-bounded exactly-once is provably impossible under realistic assumptions.

Examples: TCP, queues, email, HFT

TCP is described as:
- Within one connection: data never delivered twice by definition.
- From the app’s perspective: at-most-once, because data can be lost on failures.
Streaming frameworks (Kafka, Kinesis, Flink, Beam, Kafka Streams) use offsets/checkpoints to approximate exactly-once processing over at-least-once delivery.
Email’s Message-Id is cited as an idempotency key for deduplication.
High-frequency trading example: strict latency budgets make even at-least-once impossible to guarantee.

Idempotency, transactions, and system boundaries

Repeated point: you can build reliable, transactional behavior on unreliable components, but you pay with complexity and cross-layer logic.
Exactly-once processing is achievable inside a transactional boundary; crossing boundaries requires idempotency keys and careful coordination.
Chaining two “exactly-once” subsystems via a stateless middle still requires end-to-end idempotency.

Filesystem and low-level guarantees

Debate over whether file renames across directories are truly atomic and durable in crashes.
Distinction between POSIX-level atomicity from a process’s view and on-disk reality under crashes or in distributed filesystems.
Conclusion: even with “atomic” primitives, crash timing can still reintroduce duplicates or ambiguity.

Semantics, marketing, and practice

Several comments criticize vendors who advertise “exactly-once delivery,” arguing it’s really “exactly-once for practical purposes” or “inside our processing model.”
Some argue that if a higher layer only ever sees each message once, that’s effectively exactly-once; others insist terminology must reflect theoretical limits.
Anecdotes show real systems often have much higher duplicate rates than expected, and many apps assume exactly-once without monitoring or checks.

Related topics