How bloom filters made SQLite 10x faster
SQLite project & contribution model
- Some commenters push back on claims that SQLite is “not open contribution,” citing positive experiences reporting bugs, asking for features, and seeing patches merged.
- Others note that SQLite intentionally restricts contributions, but stress that collaboration on major features (like Bloom filters) clearly still happens.
- Consensus: “Open contribution” is not automatically better; SQLite’s model works reasonably well in practice.
SQLite in production & concurrency
- Many report using SQLite successfully in production web apps, especially for read-heavy or modest-write workloads.
- The single-writer limitation is real but often overstated; with fast transactions, one writer can handle large user bases.
- WAL mode and
busy_timeoutare highlighted as critical configuration details; lack of awareness causes unnecessary lock errors. - There is disagreement on how viable multi-process access is: some see global file locks as quickly becoming a bottleneck; others say it’s fine with tuning and realistic workloads.
Bloom filters in SQLite
- Bloom filters are used to pre-filter join probes in nested loop joins, sometimes giving large speedups by avoiding many B-tree lookups.
- A bug (bits vs bytes) meant benchmarks initially used only 1/8 of the intended filter size, greatly increasing false positives. With that bug, worst-case performance regressions are small; best-case still equals pre-optimization behavior.
- Questions are raised about how representative the “10x faster” claim is, since tests used simple integer primary keys.
Alternative filters & theory
- Discussion covers binary fuse and cuckoo filters, trade-offs between mutability, memory, CPU cache behavior, and space.
- Some argue that more advanced filters might not matter compared to disk I/O; benefits are biggest when quickly rejecting non-matching rows.
- Links and explanations touch on static-to-dynamic transformations (Bentley–Saxe), generational approaches, and b-bit hash existence filters.
Deletes, maintenance, and false positives
- Concern: frequent deletes increase false positive rate over time.
- Mitigations discussed: rebuilding per query, rotating filters, or using cuckoo filters that support deletion.
SQLite vs client–server databases & deployment
- SQLite is praised as a simple, cheap default (often co-located with the app) removing the need to run a separate DB service.
- Others argue that once you truly need multiple machines, background workers, or replication/failover, dedicated systems like Postgres are usually a better fit.
- For distributed read scaling, tools like LiteFS and SQLite’s
rsync-based replication are mentioned; Kubernetes use typically requires an extra access layer.
Joins, planners, and NP-hardness
- Surprise that SQLite uses only nested loop joins; clarification that with indexes it’s still efficient, and Bloom filters just replace many index probes.
- Some confusion over join order is resolved: the order affects how many rows survive earlier filters, changing total work even with the same asymptotic complexity.
- More broadly, query planning is discussed as an NP-hard search problem where heuristics, beam search, and genetic algorithms are used to prune plan space.