Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun

Dataset access & distribution

  • Many want regular public dumps (zip/torrent/IPFS/HuggingFace) so projects don’t need to re-scrape HN.
  • Existing options mentioned: an outdated BigQuery dataset, a private monthly CSV dump, ClickHouse scripts that pull directly from the HN API, and this project’s Apache Arrow releases.
  • Some are unsure about legal/licensing status of republishing HN data.
  • Technical notes include handling HTTP timeouts and retries when bulk-fetching via the API.

Technical approaches & tools

  • Discussion of using embeddings plus UMAP + HDBSCAN vs cheaper traditional methods (bag‑of‑words, topic models, graph analysis). Some argue simpler methods might be “good enough” for 2D maps.
  • Suggestions to use GPU-accelerated HDBSCAN/UMAP (cuML/RAPIDS) and vector search tooling (Cagra/CuVS, Lucene‑CuVS).
  • Parametric UMAP is noted as attractive but currently limited by GPU memory in common implementations.
  • Debate around models: BERT/FlagEmbedding vs LLMs, and whether classical ML (SVM, XGBoost, LSTM on BERT outputs) often suffices.
  • Cross‑encoders are proposed as a higher‑quality second pass for similarity search.

Sentiment, negativity, and toxicity on HN

  • The project finds overall negative sentiment across topics.
  • Some readers say this matches their impression of HN as cynical or toxic; others strongly disagree, seeing HN as relatively civil and “truth‑seeking but critical.”
  • Several point out that:
    • Comments are naturally more critical than votes.
    • Direct, technical criticism may be misread by generic sentiment models as “negative.”
    • A more HN‑specific sentiment or “constructiveness/usefulness” classifier might be more meaningful.
  • There is interest in comparing sentiment by topic, time, platform, and weighting by upvotes (not available in the API).

Visualization, UX, and feature ideas

  • Users praise the canvas map but request: more meaningful zooming, better cluster labeling, variable text size by importance, clearer coloring, mobile fixes, and clickable links to HN items.
  • Suggestions include showing topic/popularity on the map, shareable search URLs (since implemented), subscriptions/digests by topic, and yearly/evolution views of the embedding.

Scale, cost, and meta-discussion

  • Commenters are impressed that a hobby project used ~150 GPUs and custom infra; some question whether that scale was necessary.
  • Reported compute cost is “hundreds of dollars,” seen as modest by some and high for a hobby by others.
  • Brief side debates touch on GDPR applicability to processing HN data and on rising “self‑promotion” norms in Show HN titles.