Show HN: 22 GB of Hacker News in SQLite
Dataset size and format
- Archive covers ~46M items from 2006–2025, stored as 1,600+ SQLite “shards”, ~22GB uncompressed and ~8.5–9GB gzipped.
- Some wonder why it’s so large for mostly text and how well it runs on lower-end devices (e.g., tablets).
SQLite vs DuckDB and other stores
- Author chose SQLite directly for its simplicity and “single-file DB” model.
- Multiple commenters suggest DuckDB as a more analytics-oriented, columnar alternative with better compression and HTTP-range querying over Parquet.
- Others argue SQLite’s ubiquity and maturity make it a reasonable default, and note the schema (lots of large TEXT fields) may limit gains from columnar compression.
Client-side architecture: shards, VFS, range requests
- Core trick: SQLite compiled to WebAssembly in the browser, loading only the shards needed for the viewed day or query.
- Sharding is used instead of a single 22GB file with HTTP Range requests, partly to suit “dumb” static hosting.
- Comparisons drawn to sql.js-httpvfs, sqlite-s3vfs, PMTiles, Parquet/iceberg, and other systems that use HTTP range requests and on-demand fetching.
- Some propose experimenting with “hackends”: DuckDB + Parquet, SQLite + range requests, or even torrent/webtorrent distributions.
Performance, compatibility, UX
- Pagination and per-shard queries feel fast; querying across many shards (e.g.,
SELECT *over all) can be slow. - Discussion about
LIMITsemantics and how much data engines actually read. - Initial Firefox WASM issue reported, then confirmed working later.
- Feedback on UI: query view’s shard selection is complex; calendar navigation janky; mobile design called “brutal”; feature requests for year-level view and better comment counts (currently only top-level).
Data acquisition, dumps, legality
- Data pulled from the public BigQuery HN dataset via a script that generates JSON.gz, then ETL’d into shards.
- Several people want an official, easily downloadable HN dump instead of relying on BigQuery.
- Concern raised about whether this kind of archive conflicts with HN’s terms of use for commercial exploitation; others note this project appears non-commercial.
Offline use, Kiwix, and “baked data”
- Strong interest in fully offline use: script added to download all static assets; suggestions to package as a .zim file for Kiwix.
- Ties to the broader “baked data” or “single-table/database application” pattern: static, read-only data shipped as files, queried locally.
Analysis, related projects, and extensions
- People share similar SQLite-at-scale experiences (e.g., multi-terabyte Reddit imports) and tips (VACUUM, auto_vacuum).
- Others build and share visualizations and statistics (time-based heatmaps, score distributions) and ask to integrate them into the archive.
- Ideas floated: pairing a small LLM with the dataset, cross-platform native app, better RAG-style search over “high-quality public commons”.
Meta discussion: writing style and text vs video
- Some accuse the project description of sounding LLM-written; others push back, arguing that creative phrasing and em dashes are not evidence of AI.
- Side discussion on how text is orders of magnitude more storage-efficient than video, and workflows for converting video to text for offline reading.