GitHub - deepseek-ai/profile-data: Analyze computation-communication overlap in V3/R1.
Key Points
- 1DeepSeek-AI publicly shares profiling data from their training and inference framework to demonstrate communication-computation overlap strategies and low-level implementation details, captured via PyTorch Profiler.
- 2The training profile illustrates an overlapping strategy for DualPipe forward and backward chunks, each with four MoE layers, using an EP64 and TP1 configuration, excluding PP communication.
- 3Inference profiles for prefilling and decoding utilize two micro-batches to overlap computation with all-to-all communication; prefilling balances attention load, while decoding frees GPU SMs during all-to-all operations.
The DeepSeek-AI team publicly shares profiling data captured using PyTorch Profiler from their training and inference framework, aiming to provide insights into communication-computation overlap strategies and low-level implementation details. This data can be visualized using chrome://tracing or edge://tracing. A key simplifying assumption for all profiles is the simulation of an absolutely balanced Mixture-of-Experts (MoE) routing strategy.
The Training Profile specifically demonstrates the overlapping strategy for individual forward and backward chunks within their DualPipe architecture. Each chunk comprises four MoE layers. The parallel configuration for this profile aligns with DeepSeek-V3 pretraining settings, utilizing Expert Parallelism (EP) of 64 and Tensor Parallelism (TP) of 1, with a sequence length of 4K. For simplicity, Pipeline Parallelism (PP) communication is explicitly excluded from this profiling data.
The Inference Prefilling Profile reflects DeepSeek V3/R1's actual online deployment settings, employing EP32 and TP1. It uses a prompt length of 4K and processes a substantial batch size of 16K tokens per GPU. The core methodology for overlapping computation and all-to-all communication involves the utilization of two distinct micro-batches. This strategy ensures that the attention computation load is balanced across these two micro-batches, which may entail splitting a single prompt across them.
The Inference Decoding Profile also closely matches actual online deployment configurations, using EP128 and TP1, with a prompt length of 4K and a batch size of 128 requests per GPU. Similar to prefilling, decoding leverages two micro-batches to facilitate the overlap of computation and all-to-all communication. However, a critical distinction from prefilling is that during decoding, the all-to-all communication does not occupy GPU Streaming Multiprocessors (SMs). Once Remote Direct Memory Access (RDMA) messages are issued, all GPU SMs are freed, and the system waits for the all-to-all communication to complete only after the computation phase has finished. Further details on the all-to-all implementation are referenced to DeepEP.