2025-11-28

28M Hacker News comments as vector embedding search dataset

Permanence of HN Comments & Desire for Deletion

Several commenters wish there were an account/comment delete option and note they would have written differently had they known how reusable the data would be.
Others stress that HN has long been widely scraped; once posted, comments are effectively permanent and likely embedded in AI models and countless private archives.
Some push back on the “carved in granite” metaphor by citing link rot, but others argue both can be true: original sources vanish while many independent copies persist.

Privacy, GDPR, and “Right to be Forgotten”

Multiple people ask how to get their comments removed from third‑party datasets or tools built on HN data.
GDPR is cited as giving EU users a strong legal basis to demand deletion, though enforcing this across all copies is seen as practically impossible.
Some call HN’s “no deletion” stance a serious privacy breach and likely a GDPR violation, though untested in court.
There is skepticism that any large company truly hard-deletes data (especially from backups); others note GDPR risk makes willful non‑deletion of EU data unlikely.

Licensing, Terms of Use, and Commercial Use

HN’s terms give Y Combinator a broad, perpetual, sublicensable license over user content.
People question whether a third‑party dataset vendor is “affiliated” enough to rely on that, and whether commercial derivative use is allowed given HN’s stated bans on scraping and commercial exploitation.
Debate ensues over whether embeddings are legally “derivative works” and how that differs from human memory or personal note‑taking.
Some accept that posting on a third‑party platform inherently means ceding control via contract; others emphasize user expectations and fairness rather than strict legality.

Reactions to AI / Dataset Use

Some feel violated or socially “betrayed” that their conversational history is now trivially searchable and used to train/benchmark models.
Others shrug, arguing public text is inherently open to any form of processing, including AI training, and even relish their comments’ tiny influence on future models.
A few say LLMs reduce their motivation to post helpful content since it now benefits firms they dislike more than individual humans.

Technical Details: Size, Compression, and Embedding Models

Commenters confirm that ~55 GB Parquet for 28M comments plus embeddings is plausible; raw text for all HN posts can be under ~20 GB uncompressed and single‑digit GB compressed.
Several note how little storage text actually needs and discuss text as lossy “concept compression.”
There’s interest in the concrete hardware and costs: one similar HN embedding project reports daily updates on a MacBook and historic backfill on a cheap rented GPU; hosting costs are dominated by RAM for HNSW indexes.
Advanced users criticize the choice of all‑MiniLM‑L6‑v2 as outdated and recommend newer open‑weights embedding models (e.g., EmbeddingGemma, BGE, Qwen embeddings, nomic-embed-text), with trade‑offs around size, speed, context length, and licensing.
Others are focused on lightweight client‑side models (<100 MB) and share candidate models and leaderboards for comparison.

Search, Semantics, and Potential Applications

Some ask for comparisons between vector search and “normal” text search; BM25 is cited as the standard baseline in retrieval papers.
Ideas are floated for UI features like “find similar sentences” and semantic threading of discussions to reveal when the same debate has occurred before.
Prior work using stylometry to link alternate accounts from HN writing style is mentioned as a cautionary example of how analyzable the corpus already is.

Related topics