Qwen 3 (큐웬 3) Psychic Strategies for MoE Serving Optimization
Blog

Qwen 3 (큐웬 3) Psychic Strategies for MoE Serving Optimization

2025.05.18
·Web·by Anonymous
#LLM#MoE#Qwen3#Pruning#Optimization

Key Points

  • 1This paper identifies router bias in Qwen3 Mixture-of-Experts (MoE) models, where expert activation is uneven (e.g., "Sparse Utilization" for Korean processing), and finds that simple frequency-based pruning degrades output quality due to the complementary nature of experts.
  • 2Analysis via forward hooks and MLX patching reveals some experts are heavily utilized while many others are underutilized, yet all contribute to performance, making it crucial to go beyond mere activation frequency for importance assessment.
  • 3Sionic AI proposes an MoE Upscaling strategy involving sophisticated pruning methods, Post-Training to stabilize the modified model, and increasing the number of active experts per token (`k`) to improve performance and stability on complex inputs by leveraging efficiency gains.

The paper discusses the Mixture-of-Experts (MoE) architecture, particularly as implemented in Alibaba Cloud's Qwen3 series, and addresses the challenge of "router bias" within these models.

The MoE architecture employs multiple smaller sub-networks, known as "Experts," instead of a single monolithic network. A "Router" selectively activates specific experts based on input data characteristics. This approach aims to enhance computational efficiency while maintaining or improving performance by only engaging necessary computational resources. In Qwen3, each expert is designed to handle specialized processing for different input types or tasks, effectively scaling the model's total parameters while keeping inference costs low because only a subset of experts is active per token.

The Router's primary function is a gating mechanism: for each input token, it evaluates all experts' suitability, computes activation probabilities, and selects the k experts with the highest probabilities (top-k selection). These selected experts process the token, and their outputs are combined via a weighted sum using the router's calculated gating probabilities to produce the final output. The quality of the router's selection significantly impacts model performance.

"Router bias" refers to the phenomenon where certain experts are disproportionately activated (either too frequently or too infrequently) compared to others. This imbalance degrades efficiency by concentrating computational load on a few experts while underutilizing others, potentially leading to slower processing and resource waste. It can also cause overfitting to specific tasks or data characteristics and undermine the MoE advantage of diverse expert contributions.

Router bias can be analyzed using methods like "Forward Hooks" or patching model methods. The forward hook method involves registering a function that records expert selection indices in real-time each time an expert is chosen. While effective, it can incur performance overhead in GPU environments. An alternative, demonstrated with the MLX framework on macOS universal memory environments, involves patching the __call__ method of the MoE block. This allows intercepting and logging the expert indices (inds) chosen by the router during the forward pass, thereby accumulating hit counts for each expert.

A case study on Korean language processing with Qwen3 MoE revealed a significant "Sparse Utilization" phenomenon. Analysis showed that a small number of experts handle a disproportionately large share of the processing. For example, Expert 7 exhibited the highest Estimated Moving Average (EMA) activation rate at approximately 0.42%, followed by Expert 75 (0.31%), and Experts 20, 1, and 101 (each at 0.28%). The top 20 experts accounted for a substantial portion of overall expert utilization, indicating an over-reliance on a few dominant experts. Conversely, a considerable number of experts (around 15 or more) showed very low EMA (below 0.05%), indicating minimal contribution. This concentration of workload suggests that pruning underutilized or redundant experts could reduce computational load by approximately 30% and optimize GPU VRAM usage without compromising performance.

However, the research team at Sionic AI found that merely pruning experts based on high activation frequency can degrade output quality. For instance, activating only the top 64 experts in the Qwen3-235B-A22B model resulted in significant quality degradation, including repetitive outputs. This indicates that high activation frequency does not necessarily correlate with expert importance, and a complementary relationship exists among experts, where even less frequently chosen experts contribute critically to overall quality. The contribution of experts, excluding the top and bottom 10%, tends to decrease almost linearly in the intermediate range, suggesting an interdependent operational model.

Sionic AI proposes an "MoE Upscaling" approach that involves selective pruning strategies beyond simple activation frequency, followed by Post-Training. After pruning, the model undergoes additional training to adapt to the modified expert structure and ensure stable outputs. A key strategy is increasing the number of active experts per token (k). For example, the k value in the Qwen3-30B-A3B model was increased from 8 to 16. This is feasible because pruning reduces the total active expert parameters, allowing more experts to be activated simultaneously without a significant increase in computational load. This increased k value, despite the "num_experts_per_tok": 16 setting, improves the model's performance and stability for complex inputs by enabling it to leverage a wider range of pruned yet important experts.

Finally, the paper mentions the application of Group Relative Policy Optimization (GRPO) to evaluate and optimize experts in groups rather than individually. This involves classifying expert groups (e.g., "keep candidates," "pruning candidates") and training routing policies based on their relative performance differences. Sionic AI also researches continuous learning methodologies for domain-specific model construction.