How Discord stores trillions of messages (2023)
Scale, Work, and Overengineering
- Many are relieved not to work at Discord’s scale, citing complexity, bureaucracy, and “promotion‑driven development,” though some enjoy the rare 5% of truly hard, creative problems.
- Several note that 177–200 nodes (now ~72) for trillions of messages feels modest, suggesting a lot of modern cloud architectures are overengineered for much smaller workloads.
- Frustration that such hyperscale problems are used as interview benchmarks for roles that will never operate near this scale.
Cassandra → ScyllaDB Migration & GC Debate
- Cassandra was described as operationally painful at large scale, especially around tombstones from heavy deletes, hot partitions, and repairs.
- Some argue Discord under‑tuned Cassandra and ran old JVMs/GC (CMS, old Cassandra) and over‑blamed GC instead of upgrading to newer collectors like ZGC.
- Others counter that Scylla’s design (C++ implementation, CPU/IO schedulers, compaction strategies like ICS, tombstone_gc modes, reverse queries) plus better defaults makes it harder to “use wrong.”
- Discord engineers in the thread say they kept the schema/partitioning, added read coalescing, and Scylla handled large partitions and tombstones much better.
Database Choices & Performance
- Debate over whether Postgres (with sharding tools like Vitess/Citus/AlloyDB) could handle this workload:
- Pro‑Postgres side: scales to billions/trillions of rows if you design carefully; strong SQL features.
- Skeptical side: horizontal write scaling and multi‑master replication are much harder; sharding Postgres at this scale would be elite, bespoke work.
- LSM stores (Cassandra/Scylla) are seen as better for extreme write throughput; some question choosing a DB where reads are more expensive than writes for a chat system, others note “more expensive than writes” doesn’t mean reads are slow.
- Per‑Discord‑server databases are dismissed as infeasible with hundreds of millions of servers.
Caching, Coalescing & Tail Latency
- The message service layer is compared to Varnish/Nginx techniques (request coalescing, “grace”/stale‑while‑revalidate, use_stale).
- Discord emphasizes read coalescing rather than caching: collapse identical concurrent queries to avoid thundering herds without cache invalidation complexity.
Data Value, Archiving, and Privacy
- Some question why store “trillions of shitposts,” while others note there’s real, niche knowledge mixed in and no easy way to separate “good” from “bad.”
- Concerns that Discord’s walled‑garden model prevents long‑term archiving and discoverability; suggestions to use bots/projects to mirror public support chats.
- Strong criticism of how hard it is to truly delete old messages, with worries about GDPR compliance and lack of effective user data erasure.
Centralization vs Open Protocols
- Repeated lament that centralized Discord displaced IRC/Matrix/XMPP, framed as a tragedy for openness and ephemerality.
- Counterpoint: Discord’s success is attributed to superior UX—zero‑install web client, easy onboarding, reliable voice, history, and “just works” simplicity—where open protocols suffer from clunky clients, confusing onboarding, and federation complexity.
- Long subthread on E2EE, message history, metadata, and nostalgia for the more anonymous, ephemeral “old internet,” vs the convenience expectations of modern users.