Large-Scale Online Deanonymization with LLMs
Perceived Novelty and Methodology
- Some readers dismiss the work as “obvious” (e.g., if accounts link to LinkedIn, they’re not anonymous), but the authors clarify:
- HN data was heavily redacted to remove explicit identifiers (appendix Table 2).
- A more realistic test used Anthropic’s redacted interviewer dataset, where their agent re-identified 9/125 people based only on clues.
- Multiple commenters stress that, while the underlying OSINT ideas aren’t new, LLMs make large‑scale, cross‑platform deanonymization cheap and automatable.
Stylometry vs Semantic Clues
- Many assume stylometry (writing style) is the main attack, proposing defenses like local LLMs to rewrite text.
- The authors repeatedly state the paper “essentially doesn’t use stylometry”; it relies on semantic clues: interests, locations, workplaces, conferences, pets, etc.
- Stylometry is only lightly used in one experiment (matching split Reddit accounts and a movie‑review transformation).
- Historical stylometry work on HN and other platforms is cited as already very effective at linking alt accounts.
How Little Information Is Enough
- Commenters note how a handful of posts can leak:
- City (sports teams, landmarks), job domain, age (cultural references), work schedule (post times).
- The Netflix de-anonymization paper is referenced as early evidence that sparse, “anonymous” datasets can be re‑identified; commenters argue things have only gotten easier.
- One key point: even pseudonymous users who never directly reveal their name often “over years” leak enough crumbs to pinpoint them.
Risk Assessment and Adversaries
- One view: governments and corporations already have stronger tools, so impact is marginal.
- Counterview: lowering cost broadens the set of adversaries (scammers, harassers, activists targeting opponents, insurers, repressive states monitoring diaspora).
- Concerns include:
- Chained attacks (social engineering to collect just enough data for later deanonymization).
- Scalable doxing, job-targeted harassment, and retroactive punishment for old posts.
Mitigations and Countermeasures
- Proposed defenses:
- Local LLM “slopifiers” to rewrite style; others note this doesn’t remove semantic clues and may hurt credibility.
- Injecting noise: fake locations, jobs, hobbies; bots that post misleading content; multiple short‑lived accounts.
- “Flood the zone” strategies to create so much conflicting data that profiling becomes noisy.
- Skeptics argue:
- Noise can often be filtered; behavior patterns and interests still leak.
- Heavy use of bots/false personas risks making social media unusable and indistinguishable from spam.
Platform Design, Policy, and Behavior Changes
- Some call for:
- Stricter controls on social‑platform APIs and mass scraping.
- Better user tools (warnings when posts reveal sensitive metadata; local LLM privacy helpers).
- Features like deletion or making posts private on sites like HN.
- Several predict:
- More people will reduce public posting, rotate accounts, or rely on local LLMs.
- A shift toward local inference (to avoid API logs becoming an even richer deanonymization source).
- There is tension between:
- Using real names to stay “clean” and accountable.
- Assuming future surveillance, retroactive norm changes, and potential state or corporate abuse mean “the only winning move is not to play.”