Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun
Dataset access & distribution
- Many want regular public dumps (zip/torrent/IPFS/HuggingFace) so projects don’t need to re-scrape HN.
- Existing options mentioned: an outdated BigQuery dataset, a private monthly CSV dump, ClickHouse scripts that pull directly from the HN API, and this project’s Apache Arrow releases.
- Some are unsure about legal/licensing status of republishing HN data.
- Technical notes include handling HTTP timeouts and retries when bulk-fetching via the API.
Technical approaches & tools
- Discussion of using embeddings plus UMAP + HDBSCAN vs cheaper traditional methods (bag‑of‑words, topic models, graph analysis). Some argue simpler methods might be “good enough” for 2D maps.
- Suggestions to use GPU-accelerated HDBSCAN/UMAP (cuML/RAPIDS) and vector search tooling (Cagra/CuVS, Lucene‑CuVS).
- Parametric UMAP is noted as attractive but currently limited by GPU memory in common implementations.
- Debate around models: BERT/FlagEmbedding vs LLMs, and whether classical ML (SVM, XGBoost, LSTM on BERT outputs) often suffices.
- Cross‑encoders are proposed as a higher‑quality second pass for similarity search.
Sentiment, negativity, and toxicity on HN
- The project finds overall negative sentiment across topics.
- Some readers say this matches their impression of HN as cynical or toxic; others strongly disagree, seeing HN as relatively civil and “truth‑seeking but critical.”
- Several point out that:
- Comments are naturally more critical than votes.
- Direct, technical criticism may be misread by generic sentiment models as “negative.”
- A more HN‑specific sentiment or “constructiveness/usefulness” classifier might be more meaningful.
- There is interest in comparing sentiment by topic, time, platform, and weighting by upvotes (not available in the API).
Visualization, UX, and feature ideas
- Users praise the canvas map but request: more meaningful zooming, better cluster labeling, variable text size by importance, clearer coloring, mobile fixes, and clickable links to HN items.
- Suggestions include showing topic/popularity on the map, shareable search URLs (since implemented), subscriptions/digests by topic, and yearly/evolution views of the embedding.
Scale, cost, and meta-discussion
- Commenters are impressed that a hobby project used ~150 GPUs and custom infra; some question whether that scale was necessary.
- Reported compute cost is “hundreds of dollars,” seen as modest by some and high for a hobby by others.
- Brief side debates touch on GDPR applicability to processing HN data and on rising “self‑promotion” norms in Show HN titles.