2024-07-27

SQLite: 35% Faster Than the Filesystem

SQLite as a filesystem (“sqlitefs”)

Several people ask why no widespread “sqlitefs” exists.
Concerns: putting SQLite in the kernel is complex; FUSE adds overhead that could erase performance gains.
There are existing experiments: FUSE-based SQLite filesystems, macOS implementations, a GitHub project that mounts a SQLite DB as a FS, and Proxmox using SQLite + FUSE for VM configs.
Skepticism: turning a simple embeddable library into a full filesystem adds versioning, compatibility, and maintenance issues without clear benefit.

Why SQLite can outperform a filesystem for blobs

Core reason cited: many blobs in separate files require many open/close calls; a single SQLite file needs only one open/close, reducing syscall and access-control overhead.
Single large file improves OS caching and read-ahead; OS can better predict the working set.
Databases don’t need to support all POSIX semantics (multi-user, complex permissions, symlinks, xattrs), so they avoid that overhead.
Commenters note this is “faster than using the FS in a naive way,” not magically faster than the underlying storage.

OS and filesystem effects (especially Windows/NTFS)

Multiple anecdotes: NTFS and Windows tooling perform poorly with many small files (e.g., node_modules, big builds), sometimes 10× slower than Linux on the same hardware.
Others report little difference in specific workloads, suggesting workload and tool design matter.
A major culprit identified is synchronous antivirus / filesystem filter drivers (e.g., Defender) that hook every open/close, not NTFS alone.
Windows “Dev Drive” (ReFS + async AV) improves performance, but opinions differ on how much.
Asynchronous I/O (io_uring/IOCP) mainly improves concurrency, not per-op latency; unclear how it changes this specific benchmark.

SQLite for logs and application data

Some use SQLite (often WAL/WAL2) instead of text logs: writes are fast, reads and analysis become much more powerful (SQL, indices).
Pushback: harder to use simple tools like cat, grep, tail; workarounds involve piping sqlite3 output or writing small utilities.
For very large logs, row-store + many indexes may hurt write performance and space; columnar systems (ClickHouse, DuckDB) are suggested as better fits.

Blobs, hierarchy, and hybrid designs

Storing large blobs in SQLite works but has drawbacks: 2 GB limit, serialization complexity, and weaker tooling integration than plain files.
Incremental blob I/O APIs exist but are somewhat awkward (e.g., need zeroblob, cannot resize).
Many end up with hybrids: metadata and paths in SQLite, actual file contents on the filesystem, sometimes with virtual “folder” structures modeled in tables.

Raw devices and broader perspective

Some databases (Oracle, MySQL) can bypass filesystems and write directly to block devices, yielding modest gains at significant complexity.
Overall theme: filesystems are highly general; databases can win by optimizing for narrower access patterns and avoiding per-file and per-syscall overhead, but trade away generality and tooling.

Related topics