Reproducing Hacker News writing style fingerprinting

Perceived accuracy and limitations

  • Experiences are mixed: some users report the tool correctly finding multiple old/alt accounts (sometimes forgotten), others see no alts in the top 20–100 or mostly “random” matches.
  • Effectiveness seems strongly tied to volume of text per account; rarely used throwaways or very old accounts with few comments generally don’t match well.
  • Many note that matches often feel more “same topic” than “same style” when users commonly discuss LLMs, Musk, self‑driving, etc.
  • Similarity scores vary a lot across users (some people have many >0.85 matches, others top out around 0.75), raising questions about what “uniqueness” of style actually means here.

Methodology and technical discussion

  • The system deliberately focuses on very common “function” words (top ~500) as stylometric signals, following Burrows-style stylometry, rather than on content words.
  • The author emphasizes vector sets as a general data structure, not just for learned embeddings; cosine similarity on word-frequency vectors plus optional quantization and random projection are used.
  • Non-mutual “nearest neighbors” are explained via vector geometry and ranking; tiny non-symmetries in scores come from int8 quantization rather than the cosine itself.
  • Some commenters argue BERT-like embeddings, autoencoders, dimensionality reduction, bigrams/n‑grams, or sentence-initial words could improve authorship attribution, but also risk drifting toward topic modeling.

Visualization, clustering, and alternatives

  • Several suggest clustered or 2D visualizations (t‑SNE, MDS) and simple k‑means clustering after embedding.
  • There’s skepticism about projecting 350D down to 2D as a faithful representation, but agreement it would be fun/illustrative.

Language, dialect, and behavioral patterns

  • Users notice clustering along non‑native English backgrounds, shared first languages, or regional spelling (UK/AU vs US), as well as shared autocorrect/dictionary behavior.
  • Some observe conscious style choices (avoiding “should”, “this”, or first-person pronouns) clearly reflected in the “analyze” feature.
  • There’s interest in using such signals to guess a writer’s native language or region.

Applications, risks, and defenses

  • Many see this as a proof that online anonymity is fragile: alt accounts, astroturfing, or coordinated personas can in principle be linked.
  • Others emphasize current false positives/negatives and argue it’s far from a reliable deanonymization tool.
  • Proposed uses include detecting impersonators, bots/LLM-generated content, astroturf campaigns, or clustering ideological “styles”.
  • Suggested defenses include frequent throwaways and LLM rewriting of posts, though that may just create an “LLM style” fingerprint.

Data access and reproducibility

  • Commenters point out that HN comment data is trivially accessible via BigQuery, ClickHouse, and the official API, and provide concise SQL/ClickHouse examples to reproduce similar style vectors and nearest‑neighbor queries.