2025-02-18

Kafka at the low end: how bad can it get?

KIP-932 and Kafka-as-Queue Direction

Multiple comments highlight KIP‑932 (“Queues for Kafka”) as a major change: it should make at-least-once workers and HTTP push gateways much easier, addressing many concerns in the article.
Some view it as Kafka trying to compete more directly with traditional job/message queues, but warn the consumer model will become more complex and have “foot-guns” for newcomers.

Partitioning, Fairness, and Low-Volume Pathologies

Core article issue: with few messages and multiple workers, round‑robin or random partitioning can still leave workers idle while others are overloaded.
Some argue better partitioning (key-based, random UUIDs, more partitions than workers, multi-threaded consumers) can largely mitigate this, especially at higher volumes.
Others (including the article’s author) maintain that with genuinely low volumes, you can’t rely on probabilistic smoothing; unfair distribution still occurs in practice.

Alternative Technologies Recommended

Strong preference in the thread for simpler tools at low throughput:
- Traditional DB with SELECT … FOR UPDATE SKIP LOCKED (especially Postgres).
- RabbitMQ, AWS SQS, Azure Service Bus, Google Cloud Tasks, NATS/JetStream, Redis (lists/streams/pubsub), Beanstalkd, ActiveMQ, Pulsar, Temporal.
Advice: use the database you already have until you truly hit scale; pay for managed brokers where possible to avoid operational pain.

Database-as-Queue Debate

Some warn “DB as queue” is an antipattern; others counter that modern databases explicitly support this (e.g., SKIP LOCKED) and that simplicity and transactional enqueueing outweigh inefficiency at small scale.
Real‑world examples show Postgres queues handling tens of thousands to hundreds of millions of items per hour, with partitioning and batching used to control SKIP LOCKED overhead.
Consensus: acceptable at low–medium volume; might require careful design at high throughput.

Durability, Semantics, and Operational Complexity

Kafka is repeatedly described as a distributed write‑ahead log, not a job queue; at‑least‑once, not exactly‑once, semantics and retry handling are tricky.
Handling failures, retries, and “competing consumers” in Kafka often needs extra topics, DB tables, or custom logic.
Experience reports: Kafka is operationally heavy (JVM, cluster management, unclear durability defaults); RabbitMQ, NATS, Pulsar, or Redpanda are often perceived as simpler.
A subthread debates Kafka vs Redpanda durability (fsync, replication factors), with disagreement over how risky Kafka’s default settings are.

Ecosystem, Popularity, and Overuse of Kafka

Many see Kafka adoption in low-volume scenarios as “resume-driven development” or a red-flag dependency chosen for buzz rather than fit.
Counterpoint: even at low volume, Kafka can be justified for multi-consumer replayability, ordered per-key processing, and chaining async workflows.
Pulsar and Redpanda are discussed as technically strong Kafka alternatives, but Kafka’s ecosystem and commercial backing give it momentum.

Miscellaneous

Some teams report abandoning Kafka for RabbitMQ after hitting the fairness and complexity issues described in the article.
Side discussions cover SCADA integrations, .NET messaging libraries, and the origin/irony of the “Kafka” name.

Related topics