28M Hacker News comments as vector embedding search dataset
Permanence of HN Comments & Desire for Deletion
- Several commenters wish there were an account/comment delete option and note they would have written differently had they known how reusable the data would be.
- Others stress that HN has long been widely scraped; once posted, comments are effectively permanent and likely embedded in AI models and countless private archives.
- Some push back on the “carved in granite” metaphor by citing link rot, but others argue both can be true: original sources vanish while many independent copies persist.
Privacy, GDPR, and “Right to be Forgotten”
- Multiple people ask how to get their comments removed from third‑party datasets or tools built on HN data.
- GDPR is cited as giving EU users a strong legal basis to demand deletion, though enforcing this across all copies is seen as practically impossible.
- Some call HN’s “no deletion” stance a serious privacy breach and likely a GDPR violation, though untested in court.
- There is skepticism that any large company truly hard-deletes data (especially from backups); others note GDPR risk makes willful non‑deletion of EU data unlikely.
Licensing, Terms of Use, and Commercial Use
- HN’s terms give Y Combinator a broad, perpetual, sublicensable license over user content.
- People question whether a third‑party dataset vendor is “affiliated” enough to rely on that, and whether commercial derivative use is allowed given HN’s stated bans on scraping and commercial exploitation.
- Debate ensues over whether embeddings are legally “derivative works” and how that differs from human memory or personal note‑taking.
- Some accept that posting on a third‑party platform inherently means ceding control via contract; others emphasize user expectations and fairness rather than strict legality.
Reactions to AI / Dataset Use
- Some feel violated or socially “betrayed” that their conversational history is now trivially searchable and used to train/benchmark models.
- Others shrug, arguing public text is inherently open to any form of processing, including AI training, and even relish their comments’ tiny influence on future models.
- A few say LLMs reduce their motivation to post helpful content since it now benefits firms they dislike more than individual humans.
Technical Details: Size, Compression, and Embedding Models
- Commenters confirm that ~55 GB Parquet for 28M comments plus embeddings is plausible; raw text for all HN posts can be under ~20 GB uncompressed and single‑digit GB compressed.
- Several note how little storage text actually needs and discuss text as lossy “concept compression.”
- There’s interest in the concrete hardware and costs: one similar HN embedding project reports daily updates on a MacBook and historic backfill on a cheap rented GPU; hosting costs are dominated by RAM for HNSW indexes.
- Advanced users criticize the choice of all‑MiniLM‑L6‑v2 as outdated and recommend newer open‑weights embedding models (e.g., EmbeddingGemma, BGE, Qwen embeddings, nomic-embed-text), with trade‑offs around size, speed, context length, and licensing.
- Others are focused on lightweight client‑side models (<100 MB) and share candidate models and leaderboards for comparison.
Search, Semantics, and Potential Applications
- Some ask for comparisons between vector search and “normal” text search; BM25 is cited as the standard baseline in retrieval papers.
- Ideas are floated for UI features like “find similar sentences” and semantic threading of discussions to reveal when the same debate has occurred before.
- Prior work using stylometry to link alternate accounts from HN writing style is mentioned as a cautionary example of how analyzable the corpus already is.