GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer
Service

GitHub - deepseek-ai/EPLB: Expert Parallelism Load Balancer

deepseek-ai
2025.03.08
·GitHub·by Anonymous
#LLM#MoE#Load Balancing#DeepSeek#AI

Key Points

  • 1The Expert Parallelism Load Balancer (EPLB) aims to balance GPU loads in expert parallelism by employing a redundant experts strategy that duplicates heavy-loaded experts and heuristically packs them.
  • 2It offers two policies: Hierarchical Load Balancing for prefilling, which balances expert groups across nodes then replicates within, and Global Load Balancing for decoding, which replicates experts globally.
  • 3Inspired by DeepSeek-V3, EPLB also attempts to place experts of the same group on the same node to minimize inter-node data traffic, providing an expert replication and placement plan based on estimated expert loads.

The Expert Parallelism Load Balancer (EPLB) is an open-source algorithm designed to achieve load balancing across GPUs in systems utilizing expert parallelism (EP), as described in the DeepSeek-V3 paper. The fundamental problem it addresses is the variable load of different experts, which can lead to imbalanced GPU utilization if not managed effectively.

The core methodology of EPLB involves two primary strategies:

  1. Redundant Experts Strategy: Heavy-loaded experts are duplicated or "redundantly replicated" to distribute their computational burden across multiple computational units.
  2. Heuristic Packing: The replicated experts are then heuristically packed onto available GPUs to ensure an even distribution of workload.
  3. Group-Limited Expert Routing Consideration: For further optimization, EPLB attempts to place experts belonging to the same group on the same physical node whenever possible. This strategy is critical for minimizing inter-node data traffic, especially when using group-limited expert routing mechanisms.

The algorithm, implemented in eplb.py, computes an expert replication and placement plan based on estimated expert loads. It's important to note that the exact method for predicting expert loads (e.g., using moving averages of historical statistics) is outside the scope of EPLB itself.

EPLB provides two distinct load balancing policies, chosen based on the deployment scenario:

  1. Hierarchical Load Balancing:
    • Applicability: This policy is activated when the total number of server nodes (num_nodes) evenly divides the total number of expert groups (num_groups). It is specifically designed to leverage the benefits of group-limited expert routing.
    • Methodology: The process is hierarchical and unfolds in three stages:
      • Stage 1: Expert Group to Node Packing: Expert groups are first evenly distributed and packed across the available server nodes. This step ensures an initial balancing of load at the node level, preventing any single node from becoming a bottleneck due to an uneven distribution of expert groups.
      • Stage 2: Intra-Node Expert Replication: Within each individual node, experts are replicated based on their estimated loads. This means that if an expert within a group is identified as heavy-loaded, it will be duplicated multiple times within that specific node's allocated resources.
      • Stage 3: Intra-Node Replicated Expert to GPU Packing: Finally, the replicated experts (generated in Stage 2) are packed onto the individual GPUs residing within that specific node. This ensures fine-grained load balancing across the GPUs associated with that node.
    • Use Case: This policy is typically suitable for the prefilling stage of large language models, where a smaller expert-parallel size is often employed, and leveraging expert group locality is beneficial.
  1. Global Load Balancing:
    • Applicability: This policy is used in all other scenarios where the conditions for hierarchical load balancing are not met (i.e., when num_nodes does not evenly divide num_groups).
    • Methodology: Unlike the hierarchical approach, this policy replicates experts globally across the entire system without initial consideration of expert groups or node boundaries. The total specified number of expert replicas (num_replicas) are directly packed onto all available individual GPUs across all nodes. This simplifies the placement by treating all GPUs as a single pool.
    • Use Case: This policy is generally preferred for the decoding stage, which typically involves a larger expert-parallel size and may benefit from a more flexible, global distribution of experts.

The main interface for the load balancer is the function eplb.rebalance_experts, which takes weight (estimated expert loads), num_replicas (total desired expert replicas), num_groups, num_nodes, and num_gpus as inputs. It returns phy2log, log2phy, and logcnt, representing the computed expert replication and placement plan.