TernFS – An exabyte scale, multi-region distributed filesystem

Small Files & Metadata Scaling

  • TernFS is explicitly not optimized for tiny files; median file size in production is ~2 MB.
  • Storing billions of 1 KB files is possible and safe, but leads to:
    • Poor space efficiency.
    • Potential exhaustion of metadata structures / inode-like limits.
  • Multiple commenters explain that moving from ~trillions to quadrillions of objects makes metadata itself petabyte-scale:
    • Bulk deletion and reindexing become extremely slow and non-local.
    • Cache and scheduler state can no longer fit in RAM; “meta-scheduling” becomes necessary.
    • Worst‑case behavior and tail latency dominate design.
  • General sentiment: designing for tiny files at exabyte scale is possible but requires exotic, complex architectures; avoiding that is a reasonable tradeoff.

Comparison to CephFS and Other Systems

  • TernFS vs CephFS:
    • Ceph uses RADOS for both metadata and data; TernFS uses a specialized metadata DB and a separate block service, tuned for immutable files and low metadata churn.
    • TernFS currently runs a single deployment storing ~600 PB without sub-clustering; claims this is beyond commonly cited Ceph clusters.
    • TernFS sacrifices mutability and POSIX permissions to gain scale and simplicity; CephFS is closer to full POSIX.
    • TernFS emphasizes seamless real-time multi-region replication; commenters say Ceph does not offer that in the same way.
  • Other systems mentioned for context: SeaweedFS (good with small files), Lustre, GPFS, Isilon, 3FS, HRT’s DFS, ZeroFS, CVMFS.
  • Some view Ceph as flexible but heavy, with significant overhead for mutable workloads and high complexity.

Design Choices & Performance

  • TernFS is append-only / immutable at its core, with Reed–Solomon erasure coding and replication.
  • Uses TCP/IP and a Go-based block server relying on sendfile; authors note they can saturate NICs without RDMA, though RDMA could be added later.
  • Consensus uses a custom “Raft-like” implementation (LogsDB); currently no automatic failover, to be enabled after Jepsen-style testing.
  • Metadata operations are sharded; most activity stays within a shard. No ACLs and restricted semantics reduce complexity.
  • Includes a Linux kernel module rather than FUSE; performance difference vs FUSE not quantified but implied to be significant.

Scale, HFT Workload & Data Volume

  • Production deployment reportedly exceeds 500–600 PB for financial research.
  • Explanations for data volume:
    • High-frequency trading data: order book changes can reach ~1M messages/sec per exchange.
    • Thousands of instruments plus derivatives multiply data streams.
    • Need for full historical order book data with fine granularity limits compression options.
  • Some question the social value of spending massive compute and storage on trading; others argue that liquidity provision and tighter spreads are economically valuable.

Distributed Systems, CAP & Correctness

  • Commenters stress difficulty of getting consensus and failure modes right at this scale.
  • CAP tradeoffs highlighted: in a partition, a system must choose consistency or availability; Paxos/Raft do not “evade” CAP, they just define behavior.
  • Suspicion toward distributed systems that don’t clearly state their C vs A choices and request-level consistency knobs.

Licensing, Ecosystem & Broader Reactions

  • Core TernFS code is GPLv2-or-later; protocol definitions and client libraries are Apache 2.0 with LLVM exception to allow proprietary clients and kernel integration.
  • Multiple people praise the decision to open source such a high-value internal system; others note the strategic advantage of owning and deeply understanding your own DFS.
  • Some see TernFS as more like an object store with a filesystem veneer, optimized for a narrow but important workload.
  • Blockchain angle: a few suggest TernFS-like tech could underpin decentralized storage; others counter that immutability/decentralization don’t require blockchain and that blockchain-style metadata would be a performance bottleneck.
  • One commenter laments the focus on huge, ops-heavy DFS designs rather than “human-scale” distributed storage usable by individuals or small teams.