2025-04-17

An intro to DeepSeek's distributed file system

Workload and Motivation

3FS is described as born in a high‑frequency trading context (2019) and repurposed for AI workloads.
Target workload: huge, mostly read‑heavy datasets, petabyte scale, many clients, extremely high random‑read throughput.
Some suggest ML workloads only really need capacity, parallel reads, and redundancy, not strong consistency; others strongly push back that “consistency is hard, so skip it” tends to end badly at scale.

Architecture and Performance Characteristics

Architecturally, it’s a scale‑out metadata filesystem (like Colossus, Tectonic, HopsFS, etc.) with metadata in a distributed DB (FoundationDB).
Key points people highlight:
- NVMe + RDMA, optimized for huge batched random reads from a small set of large files.
- FUSE client for convenience, but with a hybrid mode: open via FUSE, then use a native library for the data path to avoid FUSE overhead.
- Very high random IOPS per node (tens of GiB/s, multi‑million 4K IOPS) reported; metadata ops (mdbench) not especially stellar.

Comparisons to Other Systems

ZFS: acknowledged as not scale‑out; can grow storage on one node but not aggregate multiple machines for parallel IO.
CephFS: praised for real‑world PB‑scale and used at large orgs; criticized as complex to run and relatively slow on modern NVMe without major tuning; some counter with recent Ceph benchmarks (TiB/s, multi‑million IOPS).
Alluxio, HopsFS, ObjectiveFS, JuiceFS, SeaweedFS:
- Alluxio/others already have FUSE and tiered storage; 3FS’s differentiator is truly scale‑out metadata and RDMA‑centric design.
- JuiceFS/S3‑backed designs trade much higher latency for simplicity and cheap capacity.
- SeaweedFS focuses on tiny objects with minimal metadata; 3FS on huge files, chunked and read at very high random IO rates with POSIX access.

Operations, Cost, and Cloud Constraints

Running your own 3FS‑style cluster on AWS vs FSx Lustre: rough back‑of‑envelope puts it ~12–30% cheaper, but you now own all operational complexity.
Several comments note that any self‑run storage cluster (Ceph included) is an operational bear.
Complaints that public‑cloud NVMe throughput lags commodity on‑prem SSDs, affecting how replicable 3FS‑like performance is in the cloud.

Durability, Backup, and DR

Common pattern: rely on intra‑cluster replication for hardware failures; use snapshots and possibly separate datacenters for “fat‑finger” and disaster recovery.
Distinction between redundancy (for continuous operation) and backups (for rollback in time and catastrophic mistakes) is emphasized.
Techniques mentioned: cross‑region mirroring, nearline/snapshot tiers, and traditional tape at hyperscaler scale.

Security and Backdoor Debate

One subthread questions the odds that 3FS is backdoored.
Responses split:
- Some say it’s low‑probability, especially if deployed on isolated networks.
- Others argue supply‑chain and vendor backdoors are a very real, historically demonstrated risk, and that treating them as “odd” concerns undermines serious threat modeling.
Discussion touches on nation‑state involvement and the need for defense‑in‑depth even for “internal” infrastructure.

Other Questions and Speculation

Comparisons for homelab / small‑scale setups (JuiceFS+S3, SeaweedFS) focus on latency vs simplicity rather than raw performance.
Open questions raised (but not fully resolved) about:
- How capacity expansion is handled in practice.
- What happens on metadata manager failures and what redundancy model is used.
One commenter wonders if this kind of FS makes large, CPU+NVMe‑based distributed LLM inference/training more viable, but no concrete performance analysis is provided.

Related topics