Service

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#LLM#MoE#Expert Parallelism#Communication Library#GPU Kernels

Key Points

1DeepEP is a communication library engineered for Mixture-of-Experts (MoE) and expert parallelism, delivering high-throughput, low-latency all-to-all GPU kernels for dispatch and combine operations, with support for low-precision FP8.
2The library introduces distinct "normal kernels" for training and prefilling, optimized for asymmetric bandwidth and SM control, and "low-latency kernels" for inference decoding, complemented by a novel hook-based communication-computation overlapping method that does not consume SM resources.
3DeepEP requires specific hardware like Hopper GPUs, NVLink, and RDMA networks, provides detailed performance benchmarks for various configurations, and outlines future developments including zero-copy, eager protocols, and advanced overlap techniques.

async_finish=True

Service

deepseek-ai

2025.03.08

·GitHub·by Anonymous

#LLM#MoE#Expert Parallelism#Communication Library#GPU Kernels

1DeepEP is a communication library engineered for Mixture-of-Experts (MoE) and expert parallelism, delivering high-throughput, low-latency all-to-all GPU kernels for dispatch and combine operations, with support for low-precision FP8.
2The library introduces distinct "normal kernels" for training and prefilling, optimized for asymmetric bandwidth and SM control, and "low-latency kernels" for inference decoding, complemented by a novel hook-based communication-computation overlapping method that does not consume SM resources.
3DeepEP requires specific hardware like Hopper GPUs, NVLink, and RDMA networks, provides detailed performance benchmarks for various configurations, and outlines future developments including zero-copy, eager protocols, and advanced overlap techniques.

async_finish=True