Large-Scale Online Deanonymization with LLMs

Perceived Novelty and Methodology

  • Some readers dismiss the work as “obvious” (e.g., if accounts link to LinkedIn, they’re not anonymous), but the authors clarify:
    • HN data was heavily redacted to remove explicit identifiers (appendix Table 2).
    • A more realistic test used Anthropic’s redacted interviewer dataset, where their agent re-identified 9/125 people based only on clues.
  • Multiple commenters stress that, while the underlying OSINT ideas aren’t new, LLMs make large‑scale, cross‑platform deanonymization cheap and automatable.

Stylometry vs Semantic Clues

  • Many assume stylometry (writing style) is the main attack, proposing defenses like local LLMs to rewrite text.
  • The authors repeatedly state the paper “essentially doesn’t use stylometry”; it relies on semantic clues: interests, locations, workplaces, conferences, pets, etc.
  • Stylometry is only lightly used in one experiment (matching split Reddit accounts and a movie‑review transformation).
  • Historical stylometry work on HN and other platforms is cited as already very effective at linking alt accounts.

How Little Information Is Enough

  • Commenters note how a handful of posts can leak:
    • City (sports teams, landmarks), job domain, age (cultural references), work schedule (post times).
  • The Netflix de-anonymization paper is referenced as early evidence that sparse, “anonymous” datasets can be re‑identified; commenters argue things have only gotten easier.
  • One key point: even pseudonymous users who never directly reveal their name often “over years” leak enough crumbs to pinpoint them.

Risk Assessment and Adversaries

  • One view: governments and corporations already have stronger tools, so impact is marginal.
  • Counterview: lowering cost broadens the set of adversaries (scammers, harassers, activists targeting opponents, insurers, repressive states monitoring diaspora).
  • Concerns include:
    • Chained attacks (social engineering to collect just enough data for later deanonymization).
    • Scalable doxing, job-targeted harassment, and retroactive punishment for old posts.

Mitigations and Countermeasures

  • Proposed defenses:
    • Local LLM “slopifiers” to rewrite style; others note this doesn’t remove semantic clues and may hurt credibility.
    • Injecting noise: fake locations, jobs, hobbies; bots that post misleading content; multiple short‑lived accounts.
    • “Flood the zone” strategies to create so much conflicting data that profiling becomes noisy.
  • Skeptics argue:
    • Noise can often be filtered; behavior patterns and interests still leak.
    • Heavy use of bots/false personas risks making social media unusable and indistinguishable from spam.

Platform Design, Policy, and Behavior Changes

  • Some call for:
    • Stricter controls on social‑platform APIs and mass scraping.
    • Better user tools (warnings when posts reveal sensitive metadata; local LLM privacy helpers).
    • Features like deletion or making posts private on sites like HN.
  • Several predict:
    • More people will reduce public posting, rotate accounts, or rely on local LLMs.
    • A shift toward local inference (to avoid API logs becoming an even richer deanonymization source).
  • There is tension between:
    • Using real names to stay “clean” and accountable.
    • Assuming future surveillance, retroactive norm changes, and potential state or corporate abuse mean “the only winning move is not to play.”