Working with Files Is Hard (2019)
POSIX filesystem APIs and why they’re “hard”
- Research referenced in the thread shows many prominent systems (DBs, VCSs, distributed systems) misuse file APIs, even with expert developers.
- Many argue the core problem is the POSIX model: old, entrenched, and underspecified on key semantics (ordering, atomicity, error propagation).
- Others counter that APIs can’t be “impossible to misuse” and that many apps reasonably assume simpler conditions (e.g., single-writer).
- Some see this as a “Worse is Better” outcome: cheap-to-implement semantics outcompeted safer designs.
Alternative abstractions and atomicity models
- Several proposals: whole-file atomic writes via copy-on-write, atomic appends, treating files as atomic block maps, or transactional/DB-like semantics at the filesystem level.
- Advocates claim this would remove large bug classes and simplify reasoning about shared files.
- Critics raise concerns: multi‑GB files, extra space for copy-on-write, SSD wear, multi-process access, and difficulty retrofitting existing software and filesystems.
- There’s discussion of database-style transactions (and deadlocks), with suggestions that MVCC-like approaches could mitigate some issues.
Barriers, fsync, and storage hardware behavior
- Debate over why Linux still lacks a non-flushing barrier syscall to separate ordering from durability; some think it would significantly help databases.
- Others note a prototype exists in research code but hasn’t been adopted, possibly due to limited benefit, SSD-era tradeoffs, or maintenance burden.
- NVMe, FUA, and controller caches complicate “flush” guarantees; buggy hardware and lack of proper FUA support are cited.
- It’s emphasized that some devices can lose or corrupt data even after flush, and that sector-atomic assumptions are not universally valid (e.g., certain non-volatile memories, commodity flash).
Windows, C libraries, and API evolution
- Windows file APIs are described as somewhat safer/clearer but slower, with features like IOCP and strict locking on executables.
- Lack of open research on Windows filesystems is attributed to NDAs and corporate control over publication.
- An analogy is drawn to unsafe C standard functions: attempts to “stage in” safer alternatives are messy, non-portable, and often misunderstood.
Databases, SQLite, and failure handling
- SQLite is praised as a safer choice when persisting local state, especially in specific modes (e.g., WAL, strict synchronous settings).
- Later research simulating fsync errors found that major systems (Redis, SQLite, LevelDB, LMDB, PostgreSQL) still mishandle some failure modes.
- Some systems deliberately rely on de facto hardware guarantees (sector-atomic writes), which may fail on certain devices.
NFS and distributed semantics
- NFS is criticized for breaking important file guarantees (append, exclusive create, sync flags, locks, inotify), especially across UID mappings.
- This leads to surprising behaviors such as read access changing after a successful open, complicating userland code.
Filesystem-specific behaviors and reliability
- ext4 has special logic to make common “rename for atomic replace” patterns safer.
- ZFS is discussed as robust but with Linux-specific issues under heavy load that may involve IO schedulers and external factors; there’s ongoing debugging.
- Some report more corruption with modern filesystems than FAT; others stress that power loss and hardware flaws are fundamental and must be engineered around, not merely “fixed” operationally.
Meta observations
- Many note that filesystems and storage “mostly work” until rare, harsh failure conditions.
- There’s tension between accepting imperfect semantics for 95% of use cases and demanding stronger guarantees for critical systems.