GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library
Key Points
- 1DeepEP is a communication library engineered for Mixture-of-Experts (MoE) and expert parallelism, delivering high-throughput, low-latency all-to-all GPU kernels for dispatch and combine operations, with support for low-precision FP8.
- 2The library introduces distinct "normal kernels" for training and prefilling, optimized for asymmetric bandwidth and SM control, and "low-latency kernels" for inference decoding, complemented by a novel hook-based communication-computation overlapping method that does not consume SM resources.
- 3DeepEP requires specific hardware like Hopper GPUs, NVLink, and RDMA networks, provides detailed performance benchmarks for various configurations, and outlines future developments including zero-copy, eager protocols, and advanced overlap techniques.
DeepEP is a specialized communication library designed to optimize Mixture-of-Experts (MoE) and expert parallelism (EP) workloads, particularly for large language models like DeepSeek-V3. Its primary goal is to provide high-throughput and low-latency communication primitives, specifically for MoE dispatch and combine operations on GPUs.
The core methodology of DeepEP revolves around two main types of optimized kernels:
- Normal Kernels (for Training and Inference Prefilling): These kernels are engineered for high throughput and handle the "all-to-all" communication pattern inherent in MoE layers.
- Asymmetric-Domain Bandwidth Forwarding: To align with group-limited gating algorithms, DeepEP optimizes data forwarding between different memory domains, such as from NVLink (intra-node, high bandwidth) to RDMA (inter-node, potentially lower bandwidth). This is crucial for efficient data movement across GPUs within a node and between nodes.
- SM (Streaming Multiprocessor) Number Control: Users can explicitly control the number of SMs dedicated to communication kernels, allowing for fine-grained resource management and balancing communication with computation.
- Communication-Computation Overlapping: DeepEP introduces a hook-based method for overlapping communication with computation. This mechanism is designed not to occupy any SM resources during the communication phase, allowing compute kernels to fully utilize the SMs. For instance, in
dispatch_forward, aprevious_eventcan be passed to make the dispatch kernel dependent on a prior CUDA event, enabling overlap. Similarly, and facilitate asynchronous execution and stream allocation for better overlap. - Performance Characteristics: On H800 GPUs, intranode dispatch/combine achieves approximately 150-160 GB/s via NVLink for 8 experts. Internode operations leverage RDMA, showing throughputs around 40-60 GB/s for 16-64 experts. The library supports low-precision operations, including FP8 for dispatching and BF16 for combining.
- Low-Latency Kernels (for Inference Decoding): These kernels are specifically designed for latency-sensitive inference decoding phases.
- Pure RDMA Communication: They primarily utilize pure RDMA to minimize communication delays, as latency is paramount in real-time inference.
- No SM Control API: Unlike normal kernels, there's no explicit SM control API for low-latency kernels, indicating their design prioritizes minimal SM footprint or background operation.
- CUDA Graph Compatibility: These kernels are compatible with CUDA graphs, enabling further latency reduction by pre-compiling kernel launches.
- Receiving Hook Interface: A unique feature is the
return_recv_hookoption inlow_latency_dispatchandlow_latency_combine. This hook allows RDMA network traffic to proceed in the background without occupying GPU SMs, enabling "double-batch overlapping" where data for the next micro-batch is received while the current one is being processed. The actual tensor data is only available upon calling the returned hook. - Performance Characteristics: For decoding, with 128 tokens per batch and 7168 hidden size, dispatch latency ranges from 77 us (8 experts) to 194 us (256 experts), with RDMA bandwidths from 98 GB/s down to 39 GB/s. Combine latencies are slightly higher, ranging from 114 us to 360 us.
Underlying Technical Details and Optimizations:
- Communication Buffer Management: DeepEP uses a
Bufferclass to manage communication buffers. It dynamically allocates NVLink and RDMA buffer sizes based on dispatch/combine configurations and hidden dimension size usingget_dispatch_config,get_combine_config,get_nvl_buffer_size_hint, andget_rdma_buffer_size_hint. For low-latency mode,get_low_latency_rdma_size_hintis used, requiringnum_qps_per_rankto be set, ideally equal to local experts. - Undefined-Behavior PTX Usage: For extreme performance gains, DeepEP utilizes a specific, non-standard PTX instruction pattern:
ld.global.nc.L1::no_allocate.L2::256Bfor reading volatile data. While this is an "undefined-behavior" usage according to NVIDIA's PTX documentation, it has been empirically found to be correct and significantly faster on Hopper architectures (SM90) due to the unified L1 cache and non-coherent cache, and the strongL1::no_allocatemodifier. ADISABLE_AGGRESSIVE_PTX_INSTRSflag is provided to disable this optimization if compatibility issues arise on other platforms. - Network Configuration: The library is fully tested with InfiniBand and theoretically compatible with RoCE. It recommends traffic isolation using InfiniBand Virtual Lanes (VL) via the
NVSHMEM_IB_SLenvironment variable for segregating different kernel types. Adaptive routing is suggested for heavy network loads, while static routing is preferred for light loads. Congestion control is disabled by default. - Backward Pass Implementation: The backward pass for MoE dispatch is implemented as a combine operation, and vice-versa for MoE combine, ensuring a symmetrical communication pattern for gradient propagation. For instance,
dispatch_backwardcalls_buffer.combineandcombine_backwardcalls_buffer.dispatch. - Dependency on NVSHMEM: DeepEP relies on NVSHMEM for inter-GPU communication primitives, requiring its prior installation and configuration via
NVSHMEM_DIR.
The library supports Ampere (SM80) and Hopper (SM90) GPUs, Python 3.8+, CUDA 11.0+, and PyTorch 2.1+. It emphasizes the importance of auto-tuning on target clusters to achieve optimal performance, as default configurations are based on DeepSeek's internal setup. Ongoing developments include zero-copy optimizations, eager low-latency protocols, hybrid-EP backends with TMA instructions, and fine-grained communication-computation overlap for single-batch scenarios.