Fire-Flyer File System (3FS)

Motivation and “NIH” Debate

  • Several ask why build 3FS instead of using Ceph, MinIO, SeaweedFS, etc.
  • Defenders argue existing systems are “nowhere near” fast enough for their AI/HFT workloads, especially for huge random-read training jobs and large checkpointing.
  • Some note a broader pattern in China of big companies building full in-house infra stacks; by now these are often competitive.
  • A few say NIH can be rational if it boosts capability and morale, and point out all current tools started as NIH somewhere.

Performance and Comparisons

  • 3FS reports ~6.6 TiB/s aggregate read throughput across 180 nodes while serving training jobs.
  • A Ceph reference system reaches ~1 TiB/s on 68 nodes; commenters normalize by theoretical bandwidth and note 3FS uses a larger fraction of peak.
  • Others caution this is apples-to-pears: different hardware (links, SSD count), different workloads (training vs random read benchmarks), and different block sizes.
  • Parallel FS alternatives named as competitive in this range are Lustre and Weka, with Lustre described as very fast but operationally painful.

Design and Architecture

  • 3FS is described as specialized for AI training: massive, largely non-reusable random reads where kernel read cache and prefetching are counterproductive.
  • It uses Direct I/O, turns off the file cache, and handles alignment internally to avoid extra copies.
  • FUSE is used mainly for metadata; high-performance data paths require linking a C++ client (with Python bindings). Some call this “cheating” but clever.
  • Implementation relies on Linux AIO/io_uring; there is side discussion of upcoming FUSE-over-io_uring and uncached buffered I/O in newer kernels.

Data Access Patterns (Training and Inference)

  • Random access is justified to avoid models learning spurious sequence correlations; sequential passes risk overfitting to order.
  • Others push back, preferring pre-materialized shuffles despite storage overhead and debugging complexity.
  • Latency per read is seen as less critical than aggregate throughput; pipelines overlap I/O, host–device copies, and GPU compute.
  • 3FS is also mentioned as backing KV-cache storage for inference and RAG, explaining some cost advantages.

Broader Reflections

  • Commenters link the system’s sophistication to a long HFT heritage (code dating back to ~2019) and a culture of deep performance engineering.
  • There is meta-discussion about where such skills are cultivated, differences between Chinese and US corporate/academic pipelines, and whether Western firms have drifted away from this kind of infra craftsmanship.