Reproducing Hacker News writing style fingerprinting
Perceived accuracy and limitations
- Experiences are mixed: some users report the tool correctly finding multiple old/alt accounts (sometimes forgotten), others see no alts in the top 20–100 or mostly “random” matches.
- Effectiveness seems strongly tied to volume of text per account; rarely used throwaways or very old accounts with few comments generally don’t match well.
- Many note that matches often feel more “same topic” than “same style” when users commonly discuss LLMs, Musk, self‑driving, etc.
- Similarity scores vary a lot across users (some people have many >0.85 matches, others top out around 0.75), raising questions about what “uniqueness” of style actually means here.
Methodology and technical discussion
- The system deliberately focuses on very common “function” words (top ~500) as stylometric signals, following Burrows-style stylometry, rather than on content words.
- The author emphasizes vector sets as a general data structure, not just for learned embeddings; cosine similarity on word-frequency vectors plus optional quantization and random projection are used.
- Non-mutual “nearest neighbors” are explained via vector geometry and ranking; tiny non-symmetries in scores come from int8 quantization rather than the cosine itself.
- Some commenters argue BERT-like embeddings, autoencoders, dimensionality reduction, bigrams/n‑grams, or sentence-initial words could improve authorship attribution, but also risk drifting toward topic modeling.
Visualization, clustering, and alternatives
- Several suggest clustered or 2D visualizations (t‑SNE, MDS) and simple k‑means clustering after embedding.
- There’s skepticism about projecting 350D down to 2D as a faithful representation, but agreement it would be fun/illustrative.
Language, dialect, and behavioral patterns
- Users notice clustering along non‑native English backgrounds, shared first languages, or regional spelling (UK/AU vs US), as well as shared autocorrect/dictionary behavior.
- Some observe conscious style choices (avoiding “should”, “this”, or first-person pronouns) clearly reflected in the “analyze” feature.
- There’s interest in using such signals to guess a writer’s native language or region.
Applications, risks, and defenses
- Many see this as a proof that online anonymity is fragile: alt accounts, astroturfing, or coordinated personas can in principle be linked.
- Others emphasize current false positives/negatives and argue it’s far from a reliable deanonymization tool.
- Proposed uses include detecting impersonators, bots/LLM-generated content, astroturf campaigns, or clustering ideological “styles”.
- Suggested defenses include frequent throwaways and LLM rewriting of posts, though that may just create an “LLM style” fingerprint.
Data access and reproducibility
- Commenters point out that HN comment data is trivially accessible via BigQuery, ClickHouse, and the official API, and provide concise SQL/ClickHouse examples to reproduce similar style vectors and nearest‑neighbor queries.