2025-12-30

Show HN: 22 GB of Hacker News in SQLite

Dataset size and format

Archive covers ~46M items from 2006–2025, stored as 1,600+ SQLite “shards”, ~22GB uncompressed and ~8.5–9GB gzipped.
Some wonder why it’s so large for mostly text and how well it runs on lower-end devices (e.g., tablets).

SQLite vs DuckDB and other stores

Author chose SQLite directly for its simplicity and “single-file DB” model.
Multiple commenters suggest DuckDB as a more analytics-oriented, columnar alternative with better compression and HTTP-range querying over Parquet.
Others argue SQLite’s ubiquity and maturity make it a reasonable default, and note the schema (lots of large TEXT fields) may limit gains from columnar compression.

Client-side architecture: shards, VFS, range requests

Core trick: SQLite compiled to WebAssembly in the browser, loading only the shards needed for the viewed day or query.
Sharding is used instead of a single 22GB file with HTTP Range requests, partly to suit “dumb” static hosting.
Comparisons drawn to sql.js-httpvfs, sqlite-s3vfs, PMTiles, Parquet/iceberg, and other systems that use HTTP range requests and on-demand fetching.
Some propose experimenting with “hackends”: DuckDB + Parquet, SQLite + range requests, or even torrent/webtorrent distributions.

Performance, compatibility, UX

Pagination and per-shard queries feel fast; querying across many shards (e.g., SELECT * over all) can be slow.
Discussion about LIMIT semantics and how much data engines actually read.
Initial Firefox WASM issue reported, then confirmed working later.
Feedback on UI: query view’s shard selection is complex; calendar navigation janky; mobile design called “brutal”; feature requests for year-level view and better comment counts (currently only top-level).

Data acquisition, dumps, legality

Data pulled from the public BigQuery HN dataset via a script that generates JSON.gz, then ETL’d into shards.
Several people want an official, easily downloadable HN dump instead of relying on BigQuery.
Concern raised about whether this kind of archive conflicts with HN’s terms of use for commercial exploitation; others note this project appears non-commercial.

Offline use, Kiwix, and “baked data”

Strong interest in fully offline use: script added to download all static assets; suggestions to package as a .zim file for Kiwix.
Ties to the broader “baked data” or “single-table/database application” pattern: static, read-only data shipped as files, queried locally.

Analysis, related projects, and extensions

People share similar SQLite-at-scale experiences (e.g., multi-terabyte Reddit imports) and tips (VACUUM, auto_vacuum).
Others build and share visualizations and statistics (time-based heatmaps, score distributions) and ask to integrate them into the archive.
Ideas floated: pairing a small LLM with the dataset, cross-platform native app, better RAG-style search over “high-quality public commons”.

Meta discussion: writing style and text vs video

Some accuse the project description of sounding LLM-written; others push back, arguing that creative phrasing and em dashes are not evidence of AI.
Side discussion on how text is orders of magnitude more storage-efficient than video, and workflows for converting video to text for offline reading.

Related topics