What If We Could Rebuild Kafka from Scratch?

Object-storage Kafka and Warpstream-style designs

  • Discussion around Warpstream and similar S3-backed approaches: some see them as “good enough” that Confluent preferred acquisition over building.
  • Others argue Confluent simply lacked an S3-backed story and Warpstream had drawbacks, notably higher latency that can turn into cost.
  • Several comments explain the economic driver: cross-AZ traffic between EC2 instances can be pricier than pushing data through object storage, making S3-backed Kafka cheaper (especially on AWS), plus easier scaling and multi-region active-active setups.
  • Skeptics note this is highly AWS-pricing-driven; where cross-AZ is cheap, the cost advantages may disappear, while latency and complexity remain.

Kafka complexity, UX, and operations

  • Many describe a common experience: idea looks simple (“append-only, scalable log”), reality is complex: partitions, cluster management, replication, upgrades, and recovery are painful.
  • Critiques focus on poor developer UX: confusing defaults, weak schema/story, difficult testing (desire for simple in-memory Kafka). Several test harnesses are mentioned, but they’re ecosystem add-ons, not core.
  • Operationally, troubleshooting pathological behavior or cluster failures is seen as hard; some report extreme cases where Kafka instability contributed to a product line being shut down.
  • Others counter that with managed services Kafka “just works” and has been trouble-free for years.

Misuse and unclear scope

  • Some argue Kafka “doesn’t know what it wants to be” and, like k8s/systemd, tries to “eat the world,” accumulating complexity.
  • Kafka is reported being used as a user database, KV store, or requested “because everyone else uses it” with no clear use case—seen as misuse.
  • Defenders say Kafka is fundamentally “just” a distributed log; complexity stems from broad ambitions like being an “operating system for data systems.”

Alternatives and ecosystem lock-in

  • Suggested substitutes: RabbitMQ, NATS (+JetStream), Redis Streams, Pulsar, Redpanda, AutoMQ, cloud services (SQS/SNS, Kinesis), OSS Rust-based Fluvio, and vendor offerings.
  • NATS/JetStream and Redis Streams are praised as simpler and lighter; however NATS’ marketing/docs and recent licensing drama are criticized.
  • Redpanda is liked for being Kafka-compatible, faster, and JVM-free, but its non-Apache licensing is noted.
  • Pulsar is seen as addressing some Kafka issues but introducing others; its weaker ecosystem and “nobody gets fired for picking Kafka/Confluent” dynamics slow adoption.
  • Multiple comments emphasize network effects: even a 10–30% better system struggles versus Kafka’s tooling, docs, and operator expertise.

Queues vs databases and consistency semantics

  • Some argue that for “read your own writes” semantics and derived views, it’s simpler to write directly to a database instead of Kafka.
  • Others respond that queues exist to handle retries, backpressure, and spikes (e.g., notifying millions of users at 9am) without overloading a DB, and to decouple unknown downstream consumers.
  • There’s debate over whether adding a queue inherently improves reliability, or just adds more moving parts and failure modes; several stress you must be clear why a queue is needed.

Ordering, partitions, and causality

  • Many resonate with the article’s critique of partitions: often you only care about ordering per key, while partitions create head-of-line blocking and operational headaches.
  • Discussion explores alternatives like per-key ordering (SQS FIFO group keys, parallel consumer libraries, Pulsar-style per-key acks), but notes nasty worst-case complexity: arbitrary causal dependency graphs tend to induce O(n²) time/space costs unless you fundamentally change the storage/indexing model.
  • Some suggest that fully general causal ordering would require sorted indexes and topological sorting, likely pushing you into database-like architectures with O(n log n) behavior, sacrificing some sequential-IO advantages of Kafka.
  • There’s disagreement on whether hiding partitions behind a simpler abstraction (keys, hierarchical topics, multi-tenancy) is just renaming concepts versus a meaningful UX improvement.

Rebuilds, rewrites, and “Kafka from scratch”

  • LinkedIn’s C++ “Northguard” rewrite is mentioned as an example of rethinking Kafka, but its lack of protocol compatibility is seen as a major ecosystem break.
  • Several startups (Redpanda, Fluvio, AutoMQ, Warpstream) are effectively “new Kafka implementations” exploring S3-based storage, Rust/C++ rewrites, and different processing models.
  • Some participants are wary of ground-up rewrites on principle; they view the real constraint as Kafka’s entrenched ecosystem rather than pure technical design.