An intro to DeepSeek's distributed file system

Workload and Motivation

  • 3FS is described as born in a high‑frequency trading context (2019) and repurposed for AI workloads.
  • Target workload: huge, mostly read‑heavy datasets, petabyte scale, many clients, extremely high random‑read throughput.
  • Some suggest ML workloads only really need capacity, parallel reads, and redundancy, not strong consistency; others strongly push back that “consistency is hard, so skip it” tends to end badly at scale.

Architecture and Performance Characteristics

  • Architecturally, it’s a scale‑out metadata filesystem (like Colossus, Tectonic, HopsFS, etc.) with metadata in a distributed DB (FoundationDB).
  • Key points people highlight:
    • NVMe + RDMA, optimized for huge batched random reads from a small set of large files.
    • FUSE client for convenience, but with a hybrid mode: open via FUSE, then use a native library for the data path to avoid FUSE overhead.
    • Very high random IOPS per node (tens of GiB/s, multi‑million 4K IOPS) reported; metadata ops (mdbench) not especially stellar.

Comparisons to Other Systems

  • ZFS: acknowledged as not scale‑out; can grow storage on one node but not aggregate multiple machines for parallel IO.
  • CephFS: praised for real‑world PB‑scale and used at large orgs; criticized as complex to run and relatively slow on modern NVMe without major tuning; some counter with recent Ceph benchmarks (TiB/s, multi‑million IOPS).
  • Alluxio, HopsFS, ObjectiveFS, JuiceFS, SeaweedFS:
    • Alluxio/others already have FUSE and tiered storage; 3FS’s differentiator is truly scale‑out metadata and RDMA‑centric design.
    • JuiceFS/S3‑backed designs trade much higher latency for simplicity and cheap capacity.
    • SeaweedFS focuses on tiny objects with minimal metadata; 3FS on huge files, chunked and read at very high random IO rates with POSIX access.

Operations, Cost, and Cloud Constraints

  • Running your own 3FS‑style cluster on AWS vs FSx Lustre: rough back‑of‑envelope puts it ~12–30% cheaper, but you now own all operational complexity.
  • Several comments note that any self‑run storage cluster (Ceph included) is an operational bear.
  • Complaints that public‑cloud NVMe throughput lags commodity on‑prem SSDs, affecting how replicable 3FS‑like performance is in the cloud.

Durability, Backup, and DR

  • Common pattern: rely on intra‑cluster replication for hardware failures; use snapshots and possibly separate datacenters for “fat‑finger” and disaster recovery.
  • Distinction between redundancy (for continuous operation) and backups (for rollback in time and catastrophic mistakes) is emphasized.
  • Techniques mentioned: cross‑region mirroring, nearline/snapshot tiers, and traditional tape at hyperscaler scale.

Security and Backdoor Debate

  • One subthread questions the odds that 3FS is backdoored.
  • Responses split:
    • Some say it’s low‑probability, especially if deployed on isolated networks.
    • Others argue supply‑chain and vendor backdoors are a very real, historically demonstrated risk, and that treating them as “odd” concerns undermines serious threat modeling.
  • Discussion touches on nation‑state involvement and the need for defense‑in‑depth even for “internal” infrastructure.

Other Questions and Speculation

  • Comparisons for homelab / small‑scale setups (JuiceFS+S3, SeaweedFS) focus on latency vs simplicity rather than raw performance.
  • Open questions raised (but not fully resolved) about:
    • How capacity expansion is handled in practice.
    • What happens on metadata manager failures and what redundancy model is used.
  • One commenter wonders if this kind of FS makes large, CPU+NVMe‑based distributed LLM inference/training more viable, but no concrete performance analysis is provided.