Qwen/Qwen3-235B-A22B · Hugging Face
Key Points
- 1Qwen3 represents the latest generation of Qwen large language models, featuring both dense and Mixture-of-Experts architectures engineered for groundbreaking advancements in reasoning, instruction-following, and agent capabilities.
- 2A core innovation is its unique ability to seamlessly switch between a "thinking" mode for complex logical tasks and a "non-thinking" mode for efficient general dialogue, enhancing performance across diverse scenarios.
- 3The Qwen3-235B-A22B model, with 235 billion parameters, demonstrates superior human preference alignment, excels in agentic tool-calling, supports over 100 languages, and offers an extended context window up to 131,072 tokens with YaRN.
Qwen3 is the latest generation of large language models from the Qwen series, encompassing both dense and Mixture-of-Experts (MoE) architectures. The Qwen3-235B-A22B model, a specific MoE variant, features a total of 235 billion parameters with 22 billion activated parameters per token, 234 billion non-embedding parameters, 94 layers, 64 attention heads for Queries (Q) and 4 for Keys/Values (KV) in a Grouped Query Attention (GQA) setup, 128 experts, and 8 activated experts. It natively supports a context length of 32,768 tokens, which can be extended to 131,072 tokens using the YaRN (Yet another RoPE extension) method.
The core methodology of Qwen3 revolves around several key advancements:
- Thinking and Non-Thinking Modes: Qwen3 introduces a unique capability for seamless switching between a "thinking mode" for complex logical reasoning, mathematics, and coding, and a "non-thinking mode" for efficient, general-purpose dialogue.
- Hard Switch: Controlled by the
enable_thinkingparameter intokenizer.apply_chat_template. Setting (default) activates thinking mode, causing the model to generate internal thoughts wrapped in a block, followed by the final response. strictly disables thinking, omitting the thought block. - Soft Switch: When , users can dynamically control the mode within prompts using
/thinkand/no_thinktags. The model will follow the most recent instruction in multi-turn conversations. Even with soft switches, when , a block is always output, though its content might be empty if thinking is disabled by/no_think. - Sampling Parameters: Optimal performance dictates different sampling parameters for each mode. For thinking mode, recommended settings are Temperature (T) = 0.6, Top P (\\( p \\)) = 0.95, Top K (\\( k \\)) = 20, and Min P (\\( p_{min} \\)) = 0. Greedy decoding is explicitly discouraged for thinking mode to prevent performance degradation and repetitions. For non-thinking mode, suggested settings are T = 0.7, \\( p \\) = 0.8, \\( k \\) = 20, and \\( p_{min} \\) = 0. A
presence_penaltybetween 0 and 2 can also be used to reduce repetitions.
- Hard Switch: Controlled by the
- Enhanced Reasoning and Alignment: Qwen3 demonstrates significant improvements in reasoning capabilities, outperforming prior QwQ and Qwen2.5-Instruct models in areas such as mathematics, code generation, and commonsense logical reasoning. It also achieves superior human preference alignment, excelling in creative writing, role-playing, and multi-turn dialogues, fostering a more natural conversational experience.
- Agent Capabilities: The model possesses strong tool-calling capabilities, allowing for precise integration with external tools in both thinking and non-thinking modes. It achieves leading performance among open-source models in complex agent-based tasks, particularly when used with frameworks like Qwen-Agent, which encapsulates tool-calling templates and parsers.
- Multilingual Support: Qwen3 supports over 100 languages and dialects, showcasing robust capabilities for multilingual instruction following and translation.
- Long Context Processing with YaRN: For handling texts exceeding the native 32,768 token context, Qwen3 utilizes the YaRN method for RoPE scaling. This extends the effective context window up to 131,072 tokens.
- Implementation: YaRN can be enabled by modifying the
config.jsonfile to includerope_scalingfields (e.g.,{"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}). Alternatively, command-line arguments are available for inference frameworks like vLLM and SGLang. - Static vs. Dynamic YaRN: Open-source frameworks primarily implement static YaRN, where the scaling factor remains constant, potentially affecting performance on shorter texts. Alibaba Model Studio's endpoint supports dynamic YaRN, adapting the scaling factor as needed. It's advised to apply
rope_scalingonly when truly necessary for long contexts and to adjust the factor appropriately (e.g., for 65,536-token contexts).
- Implementation: YaRN can be enabled by modifying the
Best practices for optimal performance include recommending an output length of 32,768 tokens for most queries, or 38,912 tokens for highly complex problems like competitive math/programming. Standardizing output formats for benchmarking is also advised, such as including "Please reason step by step, and put your final answer within \\boxed{}" for math problems or providing a JSON structure for multiple-choice questions (e.g., "answer": "C"). Crucially, in multi-turn conversations, only the final model output should be retained in the history, excluding any internal thinking content.