How Discord stores trillions of messages (2023)

Scale, Work, and Overengineering

  • Many are relieved not to work at Discord’s scale, citing complexity, bureaucracy, and “promotion‑driven development,” though some enjoy the rare 5% of truly hard, creative problems.
  • Several note that 177–200 nodes (now ~72) for trillions of messages feels modest, suggesting a lot of modern cloud architectures are overengineered for much smaller workloads.
  • Frustration that such hyperscale problems are used as interview benchmarks for roles that will never operate near this scale.

Cassandra → ScyllaDB Migration & GC Debate

  • Cassandra was described as operationally painful at large scale, especially around tombstones from heavy deletes, hot partitions, and repairs.
  • Some argue Discord under‑tuned Cassandra and ran old JVMs/GC (CMS, old Cassandra) and over‑blamed GC instead of upgrading to newer collectors like ZGC.
  • Others counter that Scylla’s design (C++ implementation, CPU/IO schedulers, compaction strategies like ICS, tombstone_gc modes, reverse queries) plus better defaults makes it harder to “use wrong.”
  • Discord engineers in the thread say they kept the schema/partitioning, added read coalescing, and Scylla handled large partitions and tombstones much better.

Database Choices & Performance

  • Debate over whether Postgres (with sharding tools like Vitess/Citus/AlloyDB) could handle this workload:
    • Pro‑Postgres side: scales to billions/trillions of rows if you design carefully; strong SQL features.
    • Skeptical side: horizontal write scaling and multi‑master replication are much harder; sharding Postgres at this scale would be elite, bespoke work.
  • LSM stores (Cassandra/Scylla) are seen as better for extreme write throughput; some question choosing a DB where reads are more expensive than writes for a chat system, others note “more expensive than writes” doesn’t mean reads are slow.
  • Per‑Discord‑server databases are dismissed as infeasible with hundreds of millions of servers.

Caching, Coalescing & Tail Latency

  • The message service layer is compared to Varnish/Nginx techniques (request coalescing, “grace”/stale‑while‑revalidate, use_stale).
  • Discord emphasizes read coalescing rather than caching: collapse identical concurrent queries to avoid thundering herds without cache invalidation complexity.

Data Value, Archiving, and Privacy

  • Some question why store “trillions of shitposts,” while others note there’s real, niche knowledge mixed in and no easy way to separate “good” from “bad.”
  • Concerns that Discord’s walled‑garden model prevents long‑term archiving and discoverability; suggestions to use bots/projects to mirror public support chats.
  • Strong criticism of how hard it is to truly delete old messages, with worries about GDPR compliance and lack of effective user data erasure.

Centralization vs Open Protocols

  • Repeated lament that centralized Discord displaced IRC/Matrix/XMPP, framed as a tragedy for openness and ephemerality.
  • Counterpoint: Discord’s success is attributed to superior UX—zero‑install web client, easy onboarding, reliable voice, history, and “just works” simplicity—where open protocols suffer from clunky clients, confusing onboarding, and federation complexity.
  • Long subthread on E2EE, message history, metadata, and nostalgia for the more anonymous, ephemeral “old internet,” vs the convenience expectations of modern users.