2025-04-25

What If We Could Rebuild Kafka from Scratch?

Object-storage Kafka and Warpstream-style designs

Discussion around Warpstream and similar S3-backed approaches: some see them as “good enough” that Confluent preferred acquisition over building.
Others argue Confluent simply lacked an S3-backed story and Warpstream had drawbacks, notably higher latency that can turn into cost.
Several comments explain the economic driver: cross-AZ traffic between EC2 instances can be pricier than pushing data through object storage, making S3-backed Kafka cheaper (especially on AWS), plus easier scaling and multi-region active-active setups.
Skeptics note this is highly AWS-pricing-driven; where cross-AZ is cheap, the cost advantages may disappear, while latency and complexity remain.

Kafka complexity, UX, and operations

Many describe a common experience: idea looks simple (“append-only, scalable log”), reality is complex: partitions, cluster management, replication, upgrades, and recovery are painful.
Critiques focus on poor developer UX: confusing defaults, weak schema/story, difficult testing (desire for simple in-memory Kafka). Several test harnesses are mentioned, but they’re ecosystem add-ons, not core.
Operationally, troubleshooting pathological behavior or cluster failures is seen as hard; some report extreme cases where Kafka instability contributed to a product line being shut down.
Others counter that with managed services Kafka “just works” and has been trouble-free for years.

Misuse and unclear scope

Some argue Kafka “doesn’t know what it wants to be” and, like k8s/systemd, tries to “eat the world,” accumulating complexity.
Kafka is reported being used as a user database, KV store, or requested “because everyone else uses it” with no clear use case—seen as misuse.
Defenders say Kafka is fundamentally “just” a distributed log; complexity stems from broad ambitions like being an “operating system for data systems.”

Alternatives and ecosystem lock-in

Suggested substitutes: RabbitMQ, NATS (+JetStream), Redis Streams, Pulsar, Redpanda, AutoMQ, cloud services (SQS/SNS, Kinesis), OSS Rust-based Fluvio, and vendor offerings.
NATS/JetStream and Redis Streams are praised as simpler and lighter; however NATS’ marketing/docs and recent licensing drama are criticized.
Redpanda is liked for being Kafka-compatible, faster, and JVM-free, but its non-Apache licensing is noted.
Pulsar is seen as addressing some Kafka issues but introducing others; its weaker ecosystem and “nobody gets fired for picking Kafka/Confluent” dynamics slow adoption.
Multiple comments emphasize network effects: even a 10–30% better system struggles versus Kafka’s tooling, docs, and operator expertise.

Queues vs databases and consistency semantics

Some argue that for “read your own writes” semantics and derived views, it’s simpler to write directly to a database instead of Kafka.
Others respond that queues exist to handle retries, backpressure, and spikes (e.g., notifying millions of users at 9am) without overloading a DB, and to decouple unknown downstream consumers.
There’s debate over whether adding a queue inherently improves reliability, or just adds more moving parts and failure modes; several stress you must be clear why a queue is needed.

Ordering, partitions, and causality

Many resonate with the article’s critique of partitions: often you only care about ordering per key, while partitions create head-of-line blocking and operational headaches.
Discussion explores alternatives like per-key ordering (SQS FIFO group keys, parallel consumer libraries, Pulsar-style per-key acks), but notes nasty worst-case complexity: arbitrary causal dependency graphs tend to induce O(n²) time/space costs unless you fundamentally change the storage/indexing model.
Some suggest that fully general causal ordering would require sorted indexes and topological sorting, likely pushing you into database-like architectures with O(n log n) behavior, sacrificing some sequential-IO advantages of Kafka.
There’s disagreement on whether hiding partitions behind a simpler abstraction (keys, hierarchical topics, multi-tenancy) is just renaming concepts versus a meaningful UX improvement.

Rebuilds, rewrites, and “Kafka from scratch”

LinkedIn’s C++ “Northguard” rewrite is mentioned as an example of rethinking Kafka, but its lack of protocol compatibility is seen as a major ecosystem break.
Several startups (Redpanda, Fluvio, AutoMQ, Warpstream) are effectively “new Kafka implementations” exploring S3-based storage, Rust/C++ rewrites, and different processing models.
Some participants are wary of ground-up rewrites on principle; they view the real constraint as Kafka’s entrenched ecosystem rather than pure technical design.

Related topics