An intro to DeepSeek's distributed file system
Workload and Motivation
- 3FS is described as born in a high‑frequency trading context (2019) and repurposed for AI workloads.
- Target workload: huge, mostly read‑heavy datasets, petabyte scale, many clients, extremely high random‑read throughput.
- Some suggest ML workloads only really need capacity, parallel reads, and redundancy, not strong consistency; others strongly push back that “consistency is hard, so skip it” tends to end badly at scale.
Architecture and Performance Characteristics
- Architecturally, it’s a scale‑out metadata filesystem (like Colossus, Tectonic, HopsFS, etc.) with metadata in a distributed DB (FoundationDB).
- Key points people highlight:
- NVMe + RDMA, optimized for huge batched random reads from a small set of large files.
- FUSE client for convenience, but with a hybrid mode: open via FUSE, then use a native library for the data path to avoid FUSE overhead.
- Very high random IOPS per node (tens of GiB/s, multi‑million 4K IOPS) reported; metadata ops (mdbench) not especially stellar.
Comparisons to Other Systems
- ZFS: acknowledged as not scale‑out; can grow storage on one node but not aggregate multiple machines for parallel IO.
- CephFS: praised for real‑world PB‑scale and used at large orgs; criticized as complex to run and relatively slow on modern NVMe without major tuning; some counter with recent Ceph benchmarks (TiB/s, multi‑million IOPS).
- Alluxio, HopsFS, ObjectiveFS, JuiceFS, SeaweedFS:
- Alluxio/others already have FUSE and tiered storage; 3FS’s differentiator is truly scale‑out metadata and RDMA‑centric design.
- JuiceFS/S3‑backed designs trade much higher latency for simplicity and cheap capacity.
- SeaweedFS focuses on tiny objects with minimal metadata; 3FS on huge files, chunked and read at very high random IO rates with POSIX access.
Operations, Cost, and Cloud Constraints
- Running your own 3FS‑style cluster on AWS vs FSx Lustre: rough back‑of‑envelope puts it ~12–30% cheaper, but you now own all operational complexity.
- Several comments note that any self‑run storage cluster (Ceph included) is an operational bear.
- Complaints that public‑cloud NVMe throughput lags commodity on‑prem SSDs, affecting how replicable 3FS‑like performance is in the cloud.
Durability, Backup, and DR
- Common pattern: rely on intra‑cluster replication for hardware failures; use snapshots and possibly separate datacenters for “fat‑finger” and disaster recovery.
- Distinction between redundancy (for continuous operation) and backups (for rollback in time and catastrophic mistakes) is emphasized.
- Techniques mentioned: cross‑region mirroring, nearline/snapshot tiers, and traditional tape at hyperscaler scale.
Security and Backdoor Debate
- One subthread questions the odds that 3FS is backdoored.
- Responses split:
- Some say it’s low‑probability, especially if deployed on isolated networks.
- Others argue supply‑chain and vendor backdoors are a very real, historically demonstrated risk, and that treating them as “odd” concerns undermines serious threat modeling.
- Discussion touches on nation‑state involvement and the need for defense‑in‑depth even for “internal” infrastructure.
Other Questions and Speculation
- Comparisons for homelab / small‑scale setups (JuiceFS+S3, SeaweedFS) focus on latency vs simplicity rather than raw performance.
- Open questions raised (but not fully resolved) about:
- How capacity expansion is handled in practice.
- What happens on metadata manager failures and what redundancy model is used.
- One commenter wonders if this kind of FS makes large, CPU+NVMe‑based distributed LLM inference/training more viable, but no concrete performance analysis is provided.