2025-12-20

What Does a Database for SSDs Look Like?

Sequential vs random I/O on SSDs

Several comments stress that SSDs still have a big performance gap between sequential and random I/O, just much smaller than on HDDs.
Benchmarks show multi-GB/s sequential reads vs tens of MB/s for 4K random reads; latency is microseconds instead of milliseconds, but locality still matters.
Controllers, filesystems and OS readahead are all tuned to reward large, aligned, predictable access patterns.

SSD internals and their impact

SSDs expose small logical blocks (512B/4K), but internally have:
- programming pages (4–64K) and erase blocks (1–128 MiB);
- FTLs that look conceptually like an LSM with background compaction.
Misaligned or tiny writes trigger read–modify–write and more NAND reads later; 128K-aligned writes can make random 128K reads as fast as sequential.
Fragmentation on SSDs mostly hurts via request splitting, not seek costs.

WAL, batching, and durability

One camp: WALs are still needed because host interfaces are block-based and median DB transactions modify only a few bytes. WALs provide durability and turn random page updates into sequential log writes.
Another camp: WAL is primarily about durability; batching gains really come from log-structured / LSM designs, with checkpointing and group commit as refinements.
Some note you can unify data and WAL via persistent data structures, getting cheap snapshots.

Distributed logs vs single-node commit

The article’s stance (“commit-to-disk on one node is unnecessary; durability is via replicated log”) is heavily debated.
Critics warn about correlated failures, software bugs, and cluster-wide crashes; they argue fsyncing to local SSDs remains valuable and often faster than a network round-trip.
Defenders point to designs like Aurora’s multi-AZ quorum model and argue the failure probabilities can be made acceptable, but others insist on testing over paper guarantees.

Data structures and “DBs haven’t changed”

Some claim DBs are stuck with B-trees/LSMs tuned for spinning disks and masking inefficiency with faster hardware.
Others counter that plenty of innovation exists (e.g., LeanStore/Umbra, hybrid compacting B-trees, LSM variants), but the block-device interface constrains designs.
Debate continues over B-trees vs LSMs vs hybrids: tradeoffs in write amplification, multithreaded updates, compaction overhead, and cache behavior on SSDs.

Wear and endurance

Write endurance and write amplification remain major SSD concerns; LSM’s lower amplification is highlighted as a key advantage.
Hyperscalers may care mainly about hitting 5‑year lifetimes, while smaller deployments might accept more wear or simply replace instances in the cloud.

Related topics