2025-04-16

Reproducing Hacker News writing style fingerprinting

Perceived accuracy and limitations

Experiences are mixed: some users report the tool correctly finding multiple old/alt accounts (sometimes forgotten), others see no alts in the top 20–100 or mostly “random” matches.
Effectiveness seems strongly tied to volume of text per account; rarely used throwaways or very old accounts with few comments generally don’t match well.
Many note that matches often feel more “same topic” than “same style” when users commonly discuss LLMs, Musk, self‑driving, etc.
Similarity scores vary a lot across users (some people have many >0.85 matches, others top out around 0.75), raising questions about what “uniqueness” of style actually means here.

Methodology and technical discussion

The system deliberately focuses on very common “function” words (top ~500) as stylometric signals, following Burrows-style stylometry, rather than on content words.
The author emphasizes vector sets as a general data structure, not just for learned embeddings; cosine similarity on word-frequency vectors plus optional quantization and random projection are used.
Non-mutual “nearest neighbors” are explained via vector geometry and ranking; tiny non-symmetries in scores come from int8 quantization rather than the cosine itself.
Some commenters argue BERT-like embeddings, autoencoders, dimensionality reduction, bigrams/n‑grams, or sentence-initial words could improve authorship attribution, but also risk drifting toward topic modeling.

Visualization, clustering, and alternatives

Several suggest clustered or 2D visualizations (t‑SNE, MDS) and simple k‑means clustering after embedding.
There’s skepticism about projecting 350D down to 2D as a faithful representation, but agreement it would be fun/illustrative.

Language, dialect, and behavioral patterns

Users notice clustering along non‑native English backgrounds, shared first languages, or regional spelling (UK/AU vs US), as well as shared autocorrect/dictionary behavior.
Some observe conscious style choices (avoiding “should”, “this”, or first-person pronouns) clearly reflected in the “analyze” feature.
There’s interest in using such signals to guess a writer’s native language or region.

Applications, risks, and defenses

Many see this as a proof that online anonymity is fragile: alt accounts, astroturfing, or coordinated personas can in principle be linked.
Others emphasize current false positives/negatives and argue it’s far from a reliable deanonymization tool.
Proposed uses include detecting impersonators, bots/LLM-generated content, astroturf campaigns, or clustering ideological “styles”.
Suggested defenses include frequent throwaways and LLM rewriting of posts, though that may just create an “LLM style” fingerprint.

Data access and reproducibility

Commenters point out that HN comment data is trivially accessible via BigQuery, ClickHouse, and the official API, and provide concise SQL/ClickHouse examples to reproduce similar style vectors and nearest‑neighbor queries.

Related topics