How bloom filters made SQLite 10x faster

SQLite project & contribution model

  • Some commenters push back on claims that SQLite is “not open contribution,” citing positive experiences reporting bugs, asking for features, and seeing patches merged.
  • Others note that SQLite intentionally restricts contributions, but stress that collaboration on major features (like Bloom filters) clearly still happens.
  • Consensus: “Open contribution” is not automatically better; SQLite’s model works reasonably well in practice.

SQLite in production & concurrency

  • Many report using SQLite successfully in production web apps, especially for read-heavy or modest-write workloads.
  • The single-writer limitation is real but often overstated; with fast transactions, one writer can handle large user bases.
  • WAL mode and busy_timeout are highlighted as critical configuration details; lack of awareness causes unnecessary lock errors.
  • There is disagreement on how viable multi-process access is: some see global file locks as quickly becoming a bottleneck; others say it’s fine with tuning and realistic workloads.

Bloom filters in SQLite

  • Bloom filters are used to pre-filter join probes in nested loop joins, sometimes giving large speedups by avoiding many B-tree lookups.
  • A bug (bits vs bytes) meant benchmarks initially used only 1/8 of the intended filter size, greatly increasing false positives. With that bug, worst-case performance regressions are small; best-case still equals pre-optimization behavior.
  • Questions are raised about how representative the “10x faster” claim is, since tests used simple integer primary keys.

Alternative filters & theory

  • Discussion covers binary fuse and cuckoo filters, trade-offs between mutability, memory, CPU cache behavior, and space.
  • Some argue that more advanced filters might not matter compared to disk I/O; benefits are biggest when quickly rejecting non-matching rows.
  • Links and explanations touch on static-to-dynamic transformations (Bentley–Saxe), generational approaches, and b-bit hash existence filters.

Deletes, maintenance, and false positives

  • Concern: frequent deletes increase false positive rate over time.
  • Mitigations discussed: rebuilding per query, rotating filters, or using cuckoo filters that support deletion.

SQLite vs client–server databases & deployment

  • SQLite is praised as a simple, cheap default (often co-located with the app) removing the need to run a separate DB service.
  • Others argue that once you truly need multiple machines, background workers, or replication/failover, dedicated systems like Postgres are usually a better fit.
  • For distributed read scaling, tools like LiteFS and SQLite’s rsync-based replication are mentioned; Kubernetes use typically requires an extra access layer.

Joins, planners, and NP-hardness

  • Surprise that SQLite uses only nested loop joins; clarification that with indexes it’s still efficient, and Bloom filters just replace many index probes.
  • Some confusion over join order is resolved: the order affects how many rows survive earlier filters, changing total work even with the same asymptotic complexity.
  • More broadly, query planning is discussed as an NP-hard search problem where heuristics, beam search, and genetic algorithms are used to prune plan space.