Qwen/Qwen3-Next-80B-A3B-Instruct · Hugging Face
Key Points
- 1Qwen3-Next-80B-A3B introduces a novel architecture featuring Hybrid Attention, High-Sparsity Mixture-of-Experts, stability optimizations, and Multi-Token Prediction for enhanced efficiency.
- 2This 80-billion-parameter model, with 3 billion activated parameters, achieves performance comparable to the much larger Qwen3-235B on benchmarks, demonstrating significant advantages in handling ultra-long contexts up to 256K tokens natively.
- 3Qwen3-Next-80B-A3B showcases strong parameter efficiency and inference speed, yielding 10 times higher throughput for contexts over 32K tokens and extending effectively to 1 million tokens using YaRN.
The Qwen3-Next-80B-A3B-Instruct model represents a significant advancement in large language models, focusing on enhancing scaling efficiency, total parameters, and context lengths through innovative architectural designs. It is the initial release in the Qwen3-Next series, featuring several key technical improvements aimed at improving performance while reducing computational costs.
The core methodology of Qwen3-Next-80B-A3B is built upon four primary enhancements:
- Hybrid Attention: This model replaces conventional standard attention mechanisms with a novel combination of Gated DeltaNet and Gated Attention. This hybrid approach is designed for efficient context modeling, particularly for ultra-long context lengths.
- Gated DeltaNet: This component utilizes linear attention heads, with 32 heads for the value projection () and 16 heads for query-key projection (). Each head has a dimension of 128. The "DeltaNet" suggests a departure from quadratic self-attention complexity, potentially employing a more efficient linear or sub-quadratic attention mechanism. The "Gated" aspect implies a mechanism to control or filter the information flow.
- Gated Attention: This component consists of 16 attention heads for queries () and 2 for key-value pairs (), with a head dimension of 256. It incorporates Rotary Position Embedding (RoPE) with a dimension of 64, which is crucial for handling relative positional information within sequences. The "Gated" term here also suggests a conditional or selective processing of attention outputs.
- The model's Hybrid Layout specifies the arrangement of these components within its 48 layers. Specifically, it follows a repeating block structure: . This layout indicates that within each of the 12 major blocks, there are three successive Gated DeltaNet layers followed by a Mixture-of-Experts (MoE) layer, which is then succeeded by one Gated Attention layer, also followed by an MoE layer. This intricate structure allows the model to leverage different attention mechanisms strategically.
- High-Sparsity Mixture-of-Experts (MoE): Qwen3-Next-80B-A3B employs MoE layers with an "extreme low activation ratio," significantly reducing the floating-point operations per token (FLOPs) while maintaining the model's overall capacity.
- The model has 512 experts in its MoE layers.
- Critically, only 10 experts are activated per token, leading to a high sparsity (e.g., activation ratio for non-shared experts).
- It also includes 1 shared expert, meaning that at each MoE layer, a total of 10 experts (9 routing-chosen + 1 shared) are activated for each token.
- Each expert has an intermediate dimension of 512. This high-sparsity design allows for a large number of parameters (80B total) with only a fraction (3B) being activated for any given token, leading to improved inference efficiency.
- Stability Optimizations: To ensure robust pre-training and post-training, the model incorporates advanced stability techniques. These include "zero-centered and weight-decayed layernorm" and other enhancements, which are critical for stabilizing the training of very deep and sparse models like MoEs.
- Multi-Token Prediction (MTP): This technique boosts pre-training model performance and significantly accelerates inference. While not fully detailed in the paper, MTP typically involves predicting multiple future tokens simultaneously, leveraging the internal state of the model to generate a draft of tokens that can then be validated, thereby improving throughput. For optimal MTP performance, specialized inference frameworks like SGLang and vLLM are recommended, which implement techniques such as speculative decoding.
The model boasts 80 billion total parameters, with only 3 billion parameters activated per token. The non-embedding parameters account for 79 billion. It features a hidden dimension of 2048 and 48 layers. The native context length support is 262,144 tokens, with extensibility up to 1,010,000 tokens using YaRN (Yet another RoPE extension) method.
Performance benchmarks indicate that Qwen3-Next-80B-A3B-Base achieves superior performance compared to Qwen3-32B-Base with 10% of the training cost and 10 times the inference throughput for contexts exceeding 32K tokens. The Instruct version, Qwen3-Next-80B-A3B-Instruct, performs comparably to the much larger Qwen3-235B-A22B-Instruct-2507 on various benchmarks, demonstrating significant advantages in ultra-long-context tasks up to 256K tokens, and with YaRN, even up to 1 million tokens. For example, on the 1M RULER benchmark, Qwen3-Next-80B-A3B-Instruct achieves an average accuracy of 91.8% across various context lengths up to 1 million tokens.
For deployment, the paper recommends using frameworks like SGLang (v0.5.2 or later) or vLLM (v0.10.2 or later) to create OpenAI-compatible API endpoints, allowing for tensor parallelism and configuration of context length and speculative decoding for MTP. For ultra-long text processing beyond its native 262,144 tokens, YaRN scaling can be enabled by modifying config.json with rope_scaling parameters or by passing command-line arguments to SGLang or vLLM, typically setting a factor of 4.0 for 1M token support, though it's advised to adjust the factor based on typical application context lengths. Recommended sampling parameters include Temperature=0.7, TopP=0.8, TopK=20, and MinP=0, with an optional presence_penalty to mitigate repetitions.