2025-02-28

Fire-Flyer File System (3FS)

Motivation and “NIH” Debate

Several ask why build 3FS instead of using Ceph, MinIO, SeaweedFS, etc.
Defenders argue existing systems are “nowhere near” fast enough for their AI/HFT workloads, especially for huge random-read training jobs and large checkpointing.
Some note a broader pattern in China of big companies building full in-house infra stacks; by now these are often competitive.
A few say NIH can be rational if it boosts capability and morale, and point out all current tools started as NIH somewhere.

Performance and Comparisons

3FS reports ~6.6 TiB/s aggregate read throughput across 180 nodes while serving training jobs.
A Ceph reference system reaches ~1 TiB/s on 68 nodes; commenters normalize by theoretical bandwidth and note 3FS uses a larger fraction of peak.
Others caution this is apples-to-pears: different hardware (links, SSD count), different workloads (training vs random read benchmarks), and different block sizes.
Parallel FS alternatives named as competitive in this range are Lustre and Weka, with Lustre described as very fast but operationally painful.

Design and Architecture

3FS is described as specialized for AI training: massive, largely non-reusable random reads where kernel read cache and prefetching are counterproductive.
It uses Direct I/O, turns off the file cache, and handles alignment internally to avoid extra copies.
FUSE is used mainly for metadata; high-performance data paths require linking a C++ client (with Python bindings). Some call this “cheating” but clever.
Implementation relies on Linux AIO/io_uring; there is side discussion of upcoming FUSE-over-io_uring and uncached buffered I/O in newer kernels.

Data Access Patterns (Training and Inference)

Random access is justified to avoid models learning spurious sequence correlations; sequential passes risk overfitting to order.
Others push back, preferring pre-materialized shuffles despite storage overhead and debugging complexity.
Latency per read is seen as less critical than aggregate throughput; pipelines overlap I/O, host–device copies, and GPU compute.
3FS is also mentioned as backing KV-cache storage for inference and RAG, explaining some cost advantages.

Broader Reflections

Commenters link the system’s sophistication to a long HFT heritage (code dating back to ~2019) and a culture of deep performance engineering.
There is meta-discussion about where such skills are cultivated, differences between Chinese and US corporate/academic pipelines, and whether Western firms have drifted away from this kind of infra craftsmanship.

Related topics