Show HN: 22 GB of Hacker News in SQLite

Dataset size and format

  • Archive covers ~46M items from 2006–2025, stored as 1,600+ SQLite “shards”, ~22GB uncompressed and ~8.5–9GB gzipped.
  • Some wonder why it’s so large for mostly text and how well it runs on lower-end devices (e.g., tablets).

SQLite vs DuckDB and other stores

  • Author chose SQLite directly for its simplicity and “single-file DB” model.
  • Multiple commenters suggest DuckDB as a more analytics-oriented, columnar alternative with better compression and HTTP-range querying over Parquet.
  • Others argue SQLite’s ubiquity and maturity make it a reasonable default, and note the schema (lots of large TEXT fields) may limit gains from columnar compression.

Client-side architecture: shards, VFS, range requests

  • Core trick: SQLite compiled to WebAssembly in the browser, loading only the shards needed for the viewed day or query.
  • Sharding is used instead of a single 22GB file with HTTP Range requests, partly to suit “dumb” static hosting.
  • Comparisons drawn to sql.js-httpvfs, sqlite-s3vfs, PMTiles, Parquet/iceberg, and other systems that use HTTP range requests and on-demand fetching.
  • Some propose experimenting with “hackends”: DuckDB + Parquet, SQLite + range requests, or even torrent/webtorrent distributions.

Performance, compatibility, UX

  • Pagination and per-shard queries feel fast; querying across many shards (e.g., SELECT * over all) can be slow.
  • Discussion about LIMIT semantics and how much data engines actually read.
  • Initial Firefox WASM issue reported, then confirmed working later.
  • Feedback on UI: query view’s shard selection is complex; calendar navigation janky; mobile design called “brutal”; feature requests for year-level view and better comment counts (currently only top-level).

Data acquisition, dumps, legality

  • Data pulled from the public BigQuery HN dataset via a script that generates JSON.gz, then ETL’d into shards.
  • Several people want an official, easily downloadable HN dump instead of relying on BigQuery.
  • Concern raised about whether this kind of archive conflicts with HN’s terms of use for commercial exploitation; others note this project appears non-commercial.

Offline use, Kiwix, and “baked data”

  • Strong interest in fully offline use: script added to download all static assets; suggestions to package as a .zim file for Kiwix.
  • Ties to the broader “baked data” or “single-table/database application” pattern: static, read-only data shipped as files, queried locally.

Analysis, related projects, and extensions

  • People share similar SQLite-at-scale experiences (e.g., multi-terabyte Reddit imports) and tips (VACUUM, auto_vacuum).
  • Others build and share visualizations and statistics (time-based heatmaps, score distributions) and ask to integrate them into the archive.
  • Ideas floated: pairing a small LLM with the dataset, cross-platform native app, better RAG-style search over “high-quality public commons”.

Meta discussion: writing style and text vs video

  • Some accuse the project description of sounding LLM-written; others push back, arguing that creative phrasing and em dashes are not evidence of AI.
  • Side discussion on how text is orders of magnitude more storage-efficient than video, and workflows for converting video to text for offline reading.