28M Hacker News comments as vector embedding search dataset

Permanence of HN Comments & Desire for Deletion

  • Several commenters wish there were an account/comment delete option and note they would have written differently had they known how reusable the data would be.
  • Others stress that HN has long been widely scraped; once posted, comments are effectively permanent and likely embedded in AI models and countless private archives.
  • Some push back on the “carved in granite” metaphor by citing link rot, but others argue both can be true: original sources vanish while many independent copies persist.

Privacy, GDPR, and “Right to be Forgotten”

  • Multiple people ask how to get their comments removed from third‑party datasets or tools built on HN data.
  • GDPR is cited as giving EU users a strong legal basis to demand deletion, though enforcing this across all copies is seen as practically impossible.
  • Some call HN’s “no deletion” stance a serious privacy breach and likely a GDPR violation, though untested in court.
  • There is skepticism that any large company truly hard-deletes data (especially from backups); others note GDPR risk makes willful non‑deletion of EU data unlikely.

Licensing, Terms of Use, and Commercial Use

  • HN’s terms give Y Combinator a broad, perpetual, sublicensable license over user content.
  • People question whether a third‑party dataset vendor is “affiliated” enough to rely on that, and whether commercial derivative use is allowed given HN’s stated bans on scraping and commercial exploitation.
  • Debate ensues over whether embeddings are legally “derivative works” and how that differs from human memory or personal note‑taking.
  • Some accept that posting on a third‑party platform inherently means ceding control via contract; others emphasize user expectations and fairness rather than strict legality.

Reactions to AI / Dataset Use

  • Some feel violated or socially “betrayed” that their conversational history is now trivially searchable and used to train/benchmark models.
  • Others shrug, arguing public text is inherently open to any form of processing, including AI training, and even relish their comments’ tiny influence on future models.
  • A few say LLMs reduce their motivation to post helpful content since it now benefits firms they dislike more than individual humans.

Technical Details: Size, Compression, and Embedding Models

  • Commenters confirm that ~55 GB Parquet for 28M comments plus embeddings is plausible; raw text for all HN posts can be under ~20 GB uncompressed and single‑digit GB compressed.
  • Several note how little storage text actually needs and discuss text as lossy “concept compression.”
  • There’s interest in the concrete hardware and costs: one similar HN embedding project reports daily updates on a MacBook and historic backfill on a cheap rented GPU; hosting costs are dominated by RAM for HNSW indexes.
  • Advanced users criticize the choice of all‑MiniLM‑L6‑v2 as outdated and recommend newer open‑weights embedding models (e.g., EmbeddingGemma, BGE, Qwen embeddings, nomic-embed-text), with trade‑offs around size, speed, context length, and licensing.
  • Others are focused on lightweight client‑side models (<100 MB) and share candidate models and leaderboards for comparison.

Search, Semantics, and Potential Applications

  • Some ask for comparisons between vector search and “normal” text search; BM25 is cited as the standard baseline in retrieval papers.
  • Ideas are floated for UI features like “find similar sentences” and semantic threading of discussions to reveal when the same debate has occurred before.
  • Prior work using stylometry to link alternate accounts from HN writing style is mentioned as a cautionary example of how analyzable the corpus already is.