2025-09-18

TernFS – An exabyte scale, multi-region distributed filesystem

Small Files & Metadata Scaling

TernFS is explicitly not optimized for tiny files; median file size in production is ~2 MB.
Storing billions of 1 KB files is possible and safe, but leads to:
- Poor space efficiency.
- Potential exhaustion of metadata structures / inode-like limits.
Multiple commenters explain that moving from ~trillions to quadrillions of objects makes metadata itself petabyte-scale:
- Bulk deletion and reindexing become extremely slow and non-local.
- Cache and scheduler state can no longer fit in RAM; “meta-scheduling” becomes necessary.
- Worst‑case behavior and tail latency dominate design.
General sentiment: designing for tiny files at exabyte scale is possible but requires exotic, complex architectures; avoiding that is a reasonable tradeoff.

Comparison to CephFS and Other Systems

TernFS vs CephFS:
- Ceph uses RADOS for both metadata and data; TernFS uses a specialized metadata DB and a separate block service, tuned for immutable files and low metadata churn.
- TernFS currently runs a single deployment storing ~600 PB without sub-clustering; claims this is beyond commonly cited Ceph clusters.
- TernFS sacrifices mutability and POSIX permissions to gain scale and simplicity; CephFS is closer to full POSIX.
- TernFS emphasizes seamless real-time multi-region replication; commenters say Ceph does not offer that in the same way.
Other systems mentioned for context: SeaweedFS (good with small files), Lustre, GPFS, Isilon, 3FS, HRT’s DFS, ZeroFS, CVMFS.
Some view Ceph as flexible but heavy, with significant overhead for mutable workloads and high complexity.

Design Choices & Performance

TernFS is append-only / immutable at its core, with Reed–Solomon erasure coding and replication.
Uses TCP/IP and a Go-based block server relying on sendfile; authors note they can saturate NICs without RDMA, though RDMA could be added later.
Consensus uses a custom “Raft-like” implementation (LogsDB); currently no automatic failover, to be enabled after Jepsen-style testing.
Metadata operations are sharded; most activity stays within a shard. No ACLs and restricted semantics reduce complexity.
Includes a Linux kernel module rather than FUSE; performance difference vs FUSE not quantified but implied to be significant.

Scale, HFT Workload & Data Volume

Production deployment reportedly exceeds 500–600 PB for financial research.
Explanations for data volume:
- High-frequency trading data: order book changes can reach ~1M messages/sec per exchange.
- Thousands of instruments plus derivatives multiply data streams.
- Need for full historical order book data with fine granularity limits compression options.
Some question the social value of spending massive compute and storage on trading; others argue that liquidity provision and tighter spreads are economically valuable.

Distributed Systems, CAP & Correctness

Commenters stress difficulty of getting consensus and failure modes right at this scale.
CAP tradeoffs highlighted: in a partition, a system must choose consistency or availability; Paxos/Raft do not “evade” CAP, they just define behavior.
Suspicion toward distributed systems that don’t clearly state their C vs A choices and request-level consistency knobs.

Licensing, Ecosystem & Broader Reactions

Core TernFS code is GPLv2-or-later; protocol definitions and client libraries are Apache 2.0 with LLVM exception to allow proprietary clients and kernel integration.
Multiple people praise the decision to open source such a high-value internal system; others note the strategic advantage of owning and deeply understanding your own DFS.
Some see TernFS as more like an object store with a filesystem veneer, optimized for a narrow but important workload.
Blockchain angle: a few suggest TernFS-like tech could underpin decentralized storage; others counter that immutability/decentralization don’t require blockchain and that blockchain-style metadata would be a performance bottleneck.
One commenter laments the focus on huge, ops-heavy DFS designs rather than “human-scale” distributed storage usable by individuals or small teams.

Related topics