2024-05-09

Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun

Dataset access & distribution

Many want regular public dumps (zip/torrent/IPFS/HuggingFace) so projects don’t need to re-scrape HN.
Existing options mentioned: an outdated BigQuery dataset, a private monthly CSV dump, ClickHouse scripts that pull directly from the HN API, and this project’s Apache Arrow releases.
Some are unsure about legal/licensing status of republishing HN data.
Technical notes include handling HTTP timeouts and retries when bulk-fetching via the API.

Technical approaches & tools

Discussion of using embeddings plus UMAP + HDBSCAN vs cheaper traditional methods (bag‑of‑words, topic models, graph analysis). Some argue simpler methods might be “good enough” for 2D maps.
Suggestions to use GPU-accelerated HDBSCAN/UMAP (cuML/RAPIDS) and vector search tooling (Cagra/CuVS, Lucene‑CuVS).
Parametric UMAP is noted as attractive but currently limited by GPU memory in common implementations.
Debate around models: BERT/FlagEmbedding vs LLMs, and whether classical ML (SVM, XGBoost, LSTM on BERT outputs) often suffices.
Cross‑encoders are proposed as a higher‑quality second pass for similarity search.

Sentiment, negativity, and toxicity on HN

The project finds overall negative sentiment across topics.
Some readers say this matches their impression of HN as cynical or toxic; others strongly disagree, seeing HN as relatively civil and “truth‑seeking but critical.”
Several point out that:
- Comments are naturally more critical than votes.
- Direct, technical criticism may be misread by generic sentiment models as “negative.”
- A more HN‑specific sentiment or “constructiveness/usefulness” classifier might be more meaningful.
There is interest in comparing sentiment by topic, time, platform, and weighting by upvotes (not available in the API).

Visualization, UX, and feature ideas

Users praise the canvas map but request: more meaningful zooming, better cluster labeling, variable text size by importance, clearer coloring, mobile fixes, and clickable links to HN items.
Suggestions include showing topic/popularity on the map, shareable search URLs (since implemented), subscriptions/digests by topic, and yearly/evolution views of the embedding.

Scale, cost, and meta-discussion

Commenters are impressed that a hobby project used ~150 GPUs and custom infra; some question whether that scale was necessary.
Reported compute cost is “hundreds of dollars,” seen as modest by some and high for a hobby by others.
Brief side debates touch on GDPR applicability to processing HN data and on rising “self‑promotion” norms in Show HN titles.

Related topics