Blog

Qwen 3 (큐웬 3) Psychic Strategies for MoE Serving Optimization

2025.05.18

·Web·by Anonymous

#LLM#MoE#Qwen3#Pruning#Optimization

Key Points

1This paper identifies router bias in Qwen3 Mixture-of-Experts (MoE) models, where expert activation is uneven (e.g., "Sparse Utilization" for Korean processing), and finds that simple frequency-based pruning degrades output quality due to the complementary nature of experts.
2Analysis via forward hooks and MLX patching reveals some experts are heavily utilized while many others are underutilized, yet all contribute to performance, making it crucial to go beyond mere activation frequency for importance assessment.
3Sionic AI proposes an MoE Upscaling strategy involving sophisticated pruning methods, Post-Training to stabilize the modified model, and increasing the number of active experts per token (`k`) to improve performance and stability on complex inputs by leveraging efficiency gains.

Blog

2025.05.18

·Web·by Anonymous

#LLM#MoE#Qwen3#Pruning#Optimization

1This paper identifies router bias in Qwen3 Mixture-of-Experts (MoE) models, where expert activation is uneven (e.g., "Sparse Utilization" for Korean processing), and finds that simple frequency-based pruning degrades output quality due to the complementary nature of experts.
2Analysis via forward hooks and MLX patching reveals some experts are heavily utilized while many others are underutilized, yet all contribute to performance, making it crucial to go beyond mere activation frequency for importance assessment.
3Sionic AI proposes an MoE Upscaling strategy involving sophisticated pruning methods, Post-Training to stabilize the modified model, and increasing the number of active experts per token (`k`) to improve performance and stability on complex inputs by leveraging efficiency gains.