2024-09-28

How Discord stores trillions of messages (2023)

Scale, Work, and Overengineering

Many are relieved not to work at Discord’s scale, citing complexity, bureaucracy, and “promotion‑driven development,” though some enjoy the rare 5% of truly hard, creative problems.
Several note that 177–200 nodes (now ~72) for trillions of messages feels modest, suggesting a lot of modern cloud architectures are overengineered for much smaller workloads.
Frustration that such hyperscale problems are used as interview benchmarks for roles that will never operate near this scale.

Cassandra → ScyllaDB Migration & GC Debate

Cassandra was described as operationally painful at large scale, especially around tombstones from heavy deletes, hot partitions, and repairs.
Some argue Discord under‑tuned Cassandra and ran old JVMs/GC (CMS, old Cassandra) and over‑blamed GC instead of upgrading to newer collectors like ZGC.
Others counter that Scylla’s design (C++ implementation, CPU/IO schedulers, compaction strategies like ICS, tombstone_gc modes, reverse queries) plus better defaults makes it harder to “use wrong.”
Discord engineers in the thread say they kept the schema/partitioning, added read coalescing, and Scylla handled large partitions and tombstones much better.

Database Choices & Performance

Debate over whether Postgres (with sharding tools like Vitess/Citus/AlloyDB) could handle this workload:
- Pro‑Postgres side: scales to billions/trillions of rows if you design carefully; strong SQL features.
- Skeptical side: horizontal write scaling and multi‑master replication are much harder; sharding Postgres at this scale would be elite, bespoke work.
LSM stores (Cassandra/Scylla) are seen as better for extreme write throughput; some question choosing a DB where reads are more expensive than writes for a chat system, others note “more expensive than writes” doesn’t mean reads are slow.
Per‑Discord‑server databases are dismissed as infeasible with hundreds of millions of servers.

Caching, Coalescing & Tail Latency

The message service layer is compared to Varnish/Nginx techniques (request coalescing, “grace”/stale‑while‑revalidate, use_stale).
Discord emphasizes read coalescing rather than caching: collapse identical concurrent queries to avoid thundering herds without cache invalidation complexity.

Data Value, Archiving, and Privacy

Some question why store “trillions of shitposts,” while others note there’s real, niche knowledge mixed in and no easy way to separate “good” from “bad.”
Concerns that Discord’s walled‑garden model prevents long‑term archiving and discoverability; suggestions to use bots/projects to mirror public support chats.
Strong criticism of how hard it is to truly delete old messages, with worries about GDPR compliance and lack of effective user data erasure.

Centralization vs Open Protocols

Repeated lament that centralized Discord displaced IRC/Matrix/XMPP, framed as a tragedy for openness and ephemerality.
Counterpoint: Discord’s success is attributed to superior UX—zero‑install web client, easy onboarding, reliable voice, history, and “just works” simplicity—where open protocols suffer from clunky clients, confusing onboarding, and federation complexity.
Long subthread on E2EE, message history, metadata, and nostalgia for the more anonymous, ephemeral “old internet,” vs the convenience expectations of modern users.

Related topics