2024-12-22

How bloom filters made SQLite 10x faster

SQLite project & contribution model

Some commenters push back on claims that SQLite is “not open contribution,” citing positive experiences reporting bugs, asking for features, and seeing patches merged.
Others note that SQLite intentionally restricts contributions, but stress that collaboration on major features (like Bloom filters) clearly still happens.
Consensus: “Open contribution” is not automatically better; SQLite’s model works reasonably well in practice.

SQLite in production & concurrency

Many report using SQLite successfully in production web apps, especially for read-heavy or modest-write workloads.
The single-writer limitation is real but often overstated; with fast transactions, one writer can handle large user bases.
WAL mode and busy_timeout are highlighted as critical configuration details; lack of awareness causes unnecessary lock errors.
There is disagreement on how viable multi-process access is: some see global file locks as quickly becoming a bottleneck; others say it’s fine with tuning and realistic workloads.

Bloom filters in SQLite

Bloom filters are used to pre-filter join probes in nested loop joins, sometimes giving large speedups by avoiding many B-tree lookups.
A bug (bits vs bytes) meant benchmarks initially used only 1/8 of the intended filter size, greatly increasing false positives. With that bug, worst-case performance regressions are small; best-case still equals pre-optimization behavior.
Questions are raised about how representative the “10x faster” claim is, since tests used simple integer primary keys.

Alternative filters & theory

Discussion covers binary fuse and cuckoo filters, trade-offs between mutability, memory, CPU cache behavior, and space.
Some argue that more advanced filters might not matter compared to disk I/O; benefits are biggest when quickly rejecting non-matching rows.
Links and explanations touch on static-to-dynamic transformations (Bentley–Saxe), generational approaches, and b-bit hash existence filters.

Deletes, maintenance, and false positives

Concern: frequent deletes increase false positive rate over time.
Mitigations discussed: rebuilding per query, rotating filters, or using cuckoo filters that support deletion.

SQLite vs client–server databases & deployment

SQLite is praised as a simple, cheap default (often co-located with the app) removing the need to run a separate DB service.
Others argue that once you truly need multiple machines, background workers, or replication/failover, dedicated systems like Postgres are usually a better fit.
For distributed read scaling, tools like LiteFS and SQLite’s rsync-based replication are mentioned; Kubernetes use typically requires an extra access layer.

Joins, planners, and NP-hardness

Surprise that SQLite uses only nested loop joins; clarification that with indexes it’s still efficient, and Bloom filters just replace many index probes.
Some confusion over join order is resolved: the order affects how many rows survive earlier filters, changing total work even with the same asymptotic complexity.
More broadly, query planning is discussed as an NP-hard search problem where heuristics, beam search, and genetic algorithms are used to prune plan space.

Related topics