2025-04-30

You Wouldn't Download a Hacker News

Downloading and Accessing HN Data

Several commenters have independently downloaded large portions or all of HN, using:
- The official Firebase API (often via custom clients).
- Public datasets in BigQuery and ClickHouse, which avoid heavy API traffic and allow browser-side SQL queries.
- BigQuery → Parquet → DuckDB workflows for local analytics.
There’s debate on “netiquette,” but consensus is that:
- The API is explicitly provided and not rate-limited.
- Public mirrors (e.g., on Google Cloud) exist.
- Scraping public pages “slowly and nicely” is considered acceptable.
Some suggest torrents or ZIM archives for offline / archival use, especially for a hypothetical post‑apocalyptic reading list.

Analyzing Topics and Language Trends

People critique the article’s language-frequency analysis:
- Simple substring queries (e.g., “Java”, “Rust”, “JS”, “R”) pick up lots of false positives (JavaScript, trust, antitrust, JSON, etc.).
- This makes pre‑release Rust “popularity” and some trends suspect.
- Jokes arise about naming languages “Go” or “A” to game poorly-designed popularity metrics.
Stacked area charts are widely criticized as misleading and hard to read, especially on log scales:
- Readers struggle to see true relative shares and are confused about what the y‑axis represents.
- Suggestions: non‑stacked line charts, small multiples, or aligned separate plots.

User Metrics, Voting, and Privacy

Multiple commenters want personal analytics: upvote/downvote ratios, most upvoted/flagged comments, active times, recurring upvote/foe patterns.
API limitations:
- Only submission scores are exposed; individual vote interactions are not.
- Some have resorted to scraping HTML or browser automation to recover their own vote data.
Philosophical split on voting:
- One side sees arrows as largely meaningless or even harmful (feedback to trolls, ambiguous intent).
- Others view them as useful community sentiment/visibility signals and are curious about their own downvote habits.
GDPR and “right to be forgotten” come up:
- Concern that third‑party public datasets may retain content after HN itself deletes it; how that interacts with legal obligations is unclear.

Data Volume, Storage, and Tooling

The full HN JSONL dump is ~20 GB, which some find small, others large, once metadata overhead is considered.
Comparisons with SQLite show similar sizes; JSON overhead is offset by SQLite’s indexing and internal structures.
People discuss:
- Compression (zstd Parquet often 2–3× smaller than DuckDB files).
- The idea that future APIs may directly return DuckDB or similar database files instead of raw JSON.

LLMs, Bots, and the Future of Discussion & Trust

The article’s joke about training many HN‑style bots triggers serious concern:
- Some think large‑scale LLM commenting is inevitable or already happening, especially for short comments.
- Others argue HN’s moderation, rate limiting, and culture form a strong “immune system” that would make widespread bot takeover expensive and unrewarding.
Broader worries include:
- Loss of human authenticity and meaning if feeds fill with plausible but synthetic encouragement and advice.
- Difficulty distinguishing humans from bots as models improve.
Proposed countermeasures:
- Identity verification (IDs, phone numbers, biometric checks), invite‑only communities, increasingly aggressive CAPTCHA/turnstile systems.
- Cryptographic or “web of trust” schemes to prove “unique human” status without global tracking, though these face usability, governance, and abuse challenges.
- Some see eventual state‑backed digital identity as likely; others prefer bottom‑up trust networks and note historical analogs (PGP webs of trust, TLS CA hierarchies).
Many are pessimistic about any perfect technical fix; they expect an ongoing arms race and emphasize critical thinking and better filters as essential.

Related topics