GitHub - deepseek-ai/DualPipe: A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
Service

GitHub - deepseek-ai/DualPipe: A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

deepseek-ai
2025.03.08
·GitHub·by Anonymous
#LLM#Pipeline Parallelism#Deep Learning#Distributed Training#Algorithm

Key Points

  • 1DualPipe is a bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation-communication phases, reducing pipeline bubbles in DeepSeek V3/R1 training.
  • 2DualPipeV is a concise, V-shaped schedule derived from DualPipe using a "cut-in-half" procedure, offering an efficient alternative.
  • 3Both methods aim to optimize large-scale model training by significantly reducing pipeline bubbles and improving resource utilization compared to prior pipeline parallelism techniques.

DualPipe is an innovative bidirectional pipeline parallelism algorithm, introduced in the DeepSeek-V3 Technical Report, designed to achieve full overlap of forward and backward computation-communication phases while simultaneously reducing pipeline bubbles in deep learning model training.

The core methodology of DualPipe revolves around a bidirectional execution schedule. Unlike traditional unidirectional pipeline parallelism (e.g., 1F1B), DualPipe processes micro-batches in both forward and reverse directions. This bidirectional approach enables a significant and novel computation-communication overlap. Specifically, during DualPipe's execution, two distinct phases—a forward computation chunk and a backward computation chunk, often associated with their respective communication for activation and gradient exchange—can mutually overlap. This is represented by the term F&BF\&B, denoting the execution time of two mutually overlapped forward and backward chunks, which are otherwise sequential in many common pipeline strategies. The visualization of DualPipe schedules demonstrates that "two cells enclosed by a shared black border have mutually overlapped computation and communication," indicating concurrent execution of these phases. This aggressive overlapping strategy is key to minimizing idle times (pipeline bubbles).

DualPipeV is a more concise V-shape schedule derived from DualPipe. It is generated using a "cut-in-half" procedure, a technique that further optimizes the scheduling for efficiency. While maintaining the core benefits of DualPipe in terms of overlap, DualPipeV is particularly notable for its reduced memory footprint for activations.

The performance of DualPipe and DualPipeV is rigorously compared against other pipeline parallelism methods like 1F1B and ZB1P (Zero-Bubble 1-Forward 1-Backward). The comparison is based on key metrics: pipeline bubble duration, parameter memory per device, activation memory per device, and the number of devices.
Let PPPP denote the number of pipeline stages (which is always an even number), FF denote the execution time of a forward chunk, BB denote the execution time of a full backward chunk, WW denote the execution time of a "backward for weights" chunk, and F&BF\&B denote the execution time of two mutually overlapped forward and backward chunks.

The pipeline bubble durations are specified as follows:

  • 1F1B: (PP1)(F+B)(PP-1)(F+B)
  • ZB1P: (PP1)(F+B2W)(PP-1)(F+B-2W)
  • DualPipe: (PP/21)(F&B+B3W)(PP/2-1)(F\&B+B-3W)
  • DualPipeV: (PP/21)(F&B+B3W)(PP/2-1)(F\&B+B-3W)
The significant reduction in bubbles for DualPipe and DualPipeV stems from the F&BF\&B term, representing the effective overlapping of forward and backward operations, and the factor of (PP/21)(PP/2-1) compared to (PP1)(PP-1) in 1F1B/ZB1P, reflecting the optimized bidirectional scheduling.

Regarding memory usage per device:

  • Parameter Per Device:
    • 1F1B and ZB1P: 1×PP1 \times PP
    • DualPipe and DualPipeV: 2×PP+12 \times PP + 1
  • Activation Per Device:
    • 1F1B and ZB1P: PPPP
    • DualPipe: PPPP
    • DualPipeV: PP/2PP/2
Notably, DualPipeV significantly reduces activation memory per device to PP/2PP/2, making it more memory-efficient for large models compared to other methods that require PPPP activations per device. All methods require PPPP devices.

For practical application, both DualPipe and DualPipeV necessitate the implementation of a custom overlapped_forward_backward method. This method is crucial for orchestrating the concurrent execution of forward and backward passes along with their associated communication, which is fundamental to the algorithms' efficiency gains.