Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)

Ethics of Scraping and Publication

  • Many commenters see the project as “shameful”: scraping billions of casual chats (often by minors) without their knowledge, then publishing them, is viewed as violating norms of research ethics and basic politeness, even if technically allowed.
  • Others argue it’s ethically necessary disclosure: if this is possible, intelligence agencies, criminals, and data brokers have likely done it already. Making it visible in an academic, open way is framed as “public red teaming” that forces people to confront real risks.

Public vs Private: What Does “Public Discord” Mean?

  • Dataset is limited to servers in Discord’s Discovery tab (joinable without invites). Supporters say this makes them essentially public, comparable to forums, Usenet, or StackOverflow.
  • Critics counter that “invite-based servers” and the “server” metaphor create an illusion of semi-privacy and ephemerality; users expect a flowing chatroom, not a permanent, globally queryable corpus.
  • Tension arises over whether “anyone can join and scroll back” ≈ “reasonable expectation it may be archived and redistributed.”

Anonymization and Re‑identification Risks

  • The paper describes pseudonyms and truncated SHA‑256 hashes for IDs; many find this “pretty thorough” on paper.
  • Others highlight weaknesses: unsalted hashing lets attackers hash known usernames; once a specific channel is matched, one can track those users across the dataset; references to real names or nicknames inside message text remain.
  • One commenter publishes a deeper critique claiming the ID anonymization scheme is flawed and re-identification is realistically possible.

Legal / ToS / GDPR Questions

  • Multiple comments note this likely violates Discord’s ToS and developer terms (no bulk export / sharing of API data). Debate centers on whether breaking ToS can still be “ethical.”
  • GDPR concerns: even if messages were public, true anonymization is disputed, and there is no user-level mechanism to request deletion. Others argue GDPR is misaligned with the practical permanence of public posts.

Impact on Users, Especially Youth

  • Strong worry about minors and young people: Discord has been a primary social space where teens “grow up,” make mistakes, and expect some contextual obscurity.
  • Some see this as fueling long‑term “cancel” dynamics; others say the real solution is cultural (right to forgiveness) and better education that nothing online is truly private.

Discord as Knowledge Sink and Forum Replacement

  • Separate but related thread: Discord’s rise as a replacement for forums (modding, hobby communities, docs, support) is widely criticized—poor search, walled access, fragile archives.
  • Some welcome the dataset as a way to surface technical knowledge otherwise trapped in Discord; others say that doesn’t justify mass scraping of social spaces.

Dataset Details and Distribution

  • Dataset is 118 GB Zstandard-compressed JSON (2.1 TB uncompressed), initially freely downloadable from Zenodo, then restricted; community quickly shared hashes and magnet links to redistribute it.