GitHub - deepseek-ai/3FS: A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Service

GitHub - deepseek-ai/3FS: A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

deepseek-ai
2025.03.08
ยทGitHubยทby Anonymous
#Distributed File System#AI Training#Inference Workloads#High-Performance Computing#Storage

Key Points

  • 1The Fire-Flyer File System (3FS) is a high-performance distributed file system specifically engineered to address the demanding storage challenges of AI training and inference workloads.
  • 2It features a disaggregated architecture leveraging SSDs and RDMA, ensures strong consistency through Chain Replication with Apportioned Queries (CRAQ), and provides familiar file interfaces backed by a transactional key-value store.
  • 33FS demonstrates exceptional performance, achieving 6.6 TiB/s peak read throughput on large clusters and efficiently supporting diverse use cases like large-scale data sorting and KVCache optimization for LLM inference.

The Fire-Flyer File System (3FS) is a high-performance distributed file system developed by deepseek-ai, specifically engineered to meet the demanding requirements of AI training and inference workloads. It provides a shared storage layer by harnessing modern solid-state drives (SSDs) and Remote Direct Memory Access (RDMA) networks, simplifying the development of distributed applications.

The core methodology of 3FS is built upon a disaggregated architecture that combines the collective throughput of thousands of SSDs and the network bandwidth of hundreds of storage nodes. This design enables applications to access storage resources in a locality-oblivious manner, meaning the physical location of data does not inherently impact access performance for the application.

For strong consistency, 3FS implements Chain Replication with Apportioned Queries (CRAQ). CRAQ is a robust replication protocol that ensures data consistency across the distributed system, simplifying application logic by guaranteeing that reads always return the most recent committed write. The metadata services in 3FS are stateless and are backed by a transactional key-value store, such as FoundationDB, providing reliable and scalable metadata management while exposing a familiar file interface to users.

3FS is designed to support diverse AI workloads:

  • Data Preparation: It efficiently organizes outputs from data analytics pipelines into hierarchical directory structures and manages large volumes of intermediate data.
  • Dataloaders: It eliminates the need for explicit data prefetching or shuffling by supporting random access to training samples directly from compute nodes, facilitating on-the-fly data loading for large datasets.
  • Checkpointing: It enables high-throughput parallel checkpointing for large-scale AI model training, crucial for saving model states during long-running distributed training jobs.
  • KVCache for Inference: For AI inference, particularly large language models (LLMs), 3FS offers a cost-effective alternative to DRAM-based caching for Key-Value (KV) caches. It provides high throughput and significantly larger capacity for caching key and value vectors, avoiding redundant computations in decoder layers.

Performance evaluations highlight 3FS's capabilities:

  • Peak Throughput: In a read stress test, a cluster comprising 180 storage nodes (each with 2x200Gbps InfiniBand NICs and sixteen 14TiB NVMe SSDs) and over 500 client nodes (each with 1x200Gbps InfiniBand NIC) achieved an aggregate read throughput of approximately 6.6 TiB/s, even with concurrent background training traffic. Benchmarking for this is facilitated by a custom FIO engine for USRBIO.
  • GraySort Benchmark: Evaluating sort performance on large-scale datasets, 3FS utilized a two-phase approach: data partitioning via shuffle using key prefix bits, followed by in-partition sorting. Both phases involve reading and writing data to 3FS. A test cluster with 25 storage nodes (each having 2 NUMA domains, 1 storage service per NUMA, and 2x400Gbps NICs) and 50 compute nodes (each with 2 NUMA domains, 192 physical cores, 2.2 TiB RAM, and 1x200 Gbps NIC) sorted 110.5 TiB of data across 8,192 partitions in 30 minutes and 14 seconds, achieving an average throughput of 3.66 TiB/min.
  • KVCache Performance: For LLM inference KVCache, 3FS demonstrated peak read throughput of up to 40 GiB/s across clients equipped with 1x400Gbps NICs. Concurrently, the system managed garbage collection (GC) operations, presenting robust IOPS for removing old entries.

To set up 3FS, users need to clone the GitHub repository and update submodules. Dependencies include cmake, various libuv, lz4, lzma, double-conversion, dwarf, unwind, aio, gflags, glog, gtest, gmock, clang toolchains (specific versions), gperftools, openssl, boost, build-essential, libfuse 3.16.1+, FoundationDB 7.1+, and a Rust toolchain (minimal 1.75.0, recommended 1.85.0+). The build process uses CMake, with an important consideration for the SHUFFLE_METHOD flag (-DSHUFFLE_METHOD=<method>). Due to historical std::shuffle behavior, binaries compiled with different GCC versions (e.g., g++10 vs. g++11+) can be incompatible. Users must explicitly specify the SHUFFLE_METHOD to ensure consistent algorithm selection, matching existing cluster configurations or choosing a new one for fresh deployments and maintaining it for future builds. Docker images are also provided for simplified build environments on specific operating systems like TencentOS-4 and OpenCloudOS-9.