Qwen/Qwen3-235B-A22B · Hugging Face
Service

Qwen/Qwen3-235B-A22B · Hugging Face

2025.05.18
·Hugging Face·by Anonymous
#LLM#Transformers#Qwen#Text Generation#Conversational AI

Key Points

  • 1Qwen3 represents the latest generation of Qwen large language models, featuring both dense and Mixture-of-Experts architectures engineered for groundbreaking advancements in reasoning, instruction-following, and agent capabilities.
  • 2A core innovation is its unique ability to seamlessly switch between a "thinking" mode for complex logical tasks and a "non-thinking" mode for efficient general dialogue, enhancing performance across diverse scenarios.
  • 3The Qwen3-235B-A22B model, with 235 billion parameters, demonstrates superior human preference alignment, excels in agentic tool-calling, supports over 100 languages, and offers an extended context window up to 131,072 tokens with YaRN.

Qwen3 is the latest generation of large language models from the Qwen series, encompassing both dense and Mixture-of-Experts (MoE) architectures. The Qwen3-235B-A22B model, a specific MoE variant, features a total of 235 billion parameters with 22 billion activated parameters per token, 234 billion non-embedding parameters, 94 layers, 64 attention heads for Queries (Q) and 4 for Keys/Values (KV) in a Grouped Query Attention (GQA) setup, 128 experts, and 8 activated experts. It natively supports a context length of 32,768 tokens, which can be extended to 131,072 tokens using the YaRN (Yet another RoPE extension) method.

The core methodology of Qwen3 revolves around several key advancements:

  1. Thinking and Non-Thinking Modes: Qwen3 introduces a unique capability for seamless switching between a "thinking mode" for complex logical reasoning, mathematics, and coding, and a "non-thinking mode" for efficient, general-purpose dialogue.
    • Hard Switch: Controlled by the enable_thinking parameter in tokenizer.apply_chat_template. Setting enablethinking=Trueenable_thinking=True (default) activates thinking mode, causing the model to generate internal thoughts wrapped in a <think>...</think><think>...</think> block, followed by the final response. enablethinking=Falseenable_thinking=False strictly disables thinking, omitting the thought block.
    • Soft Switch: When enablethinking=Trueenable_thinking=True, users can dynamically control the mode within prompts using /think and /no_think tags. The model will follow the most recent instruction in multi-turn conversations. Even with soft switches, when enablethinking=Trueenable_thinking=True, a <think>...</think><think>...</think> block is always output, though its content might be empty if thinking is disabled by /no_think.
    • Sampling Parameters: Optimal performance dictates different sampling parameters for each mode. For thinking mode, recommended settings are Temperature (T) = 0.6, Top P (\\( p \\)) = 0.95, Top K (\\( k \\)) = 20, and Min P (\\( p_{min} \\)) = 0. Greedy decoding is explicitly discouraged for thinking mode to prevent performance degradation and repetitions. For non-thinking mode, suggested settings are T = 0.7, \\( p \\) = 0.8, \\( k \\) = 20, and \\( p_{min} \\) = 0. A presence_penalty between 0 and 2 can also be used to reduce repetitions.
  1. Enhanced Reasoning and Alignment: Qwen3 demonstrates significant improvements in reasoning capabilities, outperforming prior QwQ and Qwen2.5-Instruct models in areas such as mathematics, code generation, and commonsense logical reasoning. It also achieves superior human preference alignment, excelling in creative writing, role-playing, and multi-turn dialogues, fostering a more natural conversational experience.
  1. Agent Capabilities: The model possesses strong tool-calling capabilities, allowing for precise integration with external tools in both thinking and non-thinking modes. It achieves leading performance among open-source models in complex agent-based tasks, particularly when used with frameworks like Qwen-Agent, which encapsulates tool-calling templates and parsers.
  1. Multilingual Support: Qwen3 supports over 100 languages and dialects, showcasing robust capabilities for multilingual instruction following and translation.
  1. Long Context Processing with YaRN: For handling texts exceeding the native 32,768 token context, Qwen3 utilizes the YaRN method for RoPE scaling. This extends the effective context window up to 131,072 tokens.
    • Implementation: YaRN can be enabled by modifying the config.json file to include rope_scaling fields (e.g., {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}). Alternatively, command-line arguments are available for inference frameworks like vLLM and SGLang.
    • Static vs. Dynamic YaRN: Open-source frameworks primarily implement static YaRN, where the scaling factor remains constant, potentially affecting performance on shorter texts. Alibaba Model Studio's endpoint supports dynamic YaRN, adapting the scaling factor as needed. It's advised to apply rope_scaling only when truly necessary for long contexts and to adjust the factor appropriately (e.g., factor=2.0factor = 2.0 for 65,536-token contexts).

Best practices for optimal performance include recommending an output length of 32,768 tokens for most queries, or 38,912 tokens for highly complex problems like competitive math/programming. Standardizing output formats for benchmarking is also advised, such as including "Please reason step by step, and put your final answer within \\boxed{}" for math problems or providing a JSON structure for multiple-choice questions (e.g., "answer": "C"). Crucially, in multi-turn conversations, only the final model output should be retained in the history, excluding any internal thinking content.