GitHub - QwenLM/Qwen3.5: Qwen3.5 is the large language model series developed by Qwen team, Alibaba Cloud.
Blog

GitHub - QwenLM/Qwen3.5: Qwen3.5 is the large language model series developed by Qwen team, Alibaba Cloud.

QwenLM
2026.02.23
·GitHub·by 이호민
#AI#LLM#Multimodal#Open Source#Qwen

Key Points

  • 1Qwen3.5 is a new series of large language models developed by Alibaba Cloud, significantly advancing foundation models with unified vision-language capabilities and enhanced performance.
  • 2It integrates key innovations such as an efficient hybrid architecture with sparse Mixture-of-Experts, scalable reinforcement learning, and expanded support for 201 languages and dialects.
  • 3The models, including an initial 397B-A17B MoE version, are available on Hugging Face and ModelScope, supporting various inference frameworks and finetuning methods for broad deployment.

Qwen3.5 represents a significant advancement in large language models, developed by the Qwen team at Alibaba Cloud, integrating breakthroughs across multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility.

The core methodology of Qwen3.5 is characterized by five key enhancements:

  1. Unified Vision-Language Foundation: This foundational capability is achieved through early fusion training on trillions of multimodal tokens. Unlike traditional approaches that might process modalities separately before fusion, Qwen3.5 integrates visual and linguistic information from the initial stages of training. This deep, native multimodal integration allows Qwen3.5 to achieve cross-generational parity with its text-only predecessor, Qwen3, and surpass the performance of earlier Qwen3-VL models across diverse benchmarks including reasoning, coding, agentic capabilities, and visual understanding. This suggests a shared representational space for both modalities from the outset.
  1. Efficient Hybrid Architecture: Qwen3.5 incorporates a novel architectural design to ensure high-throughput inference while minimizing latency and cost. This is primarily facilitated by the combination of Gated Delta Networks and sparse Mixture-of-Experts (MoE).
    • Gated Delta Networks implies a network design where specific gating mechanisms or differential components are utilized to control information flow or adapt computation, contributing to efficiency.
    • Sparse Mixture-of-Experts (MoE) is a crucial component, exemplified by the initial release of a 397-billion-parameter (397B) model with only 17 billion active parameters (A17B) per inference. In an MoE architecture, the model consists of multiple "expert" sub-networks. For any given input token or sequence, a learned "router" or "gating network" determines which small subset of these experts (e.g., 2-4 out of many) should process the input. This allows for a massive total parameter count, enhancing model capacity and knowledge, while only a small, constant number of parameters are activated and incur computational cost during inference, significantly improving throughput and reducing latency compared to dense models of similar total parameter count. The mention of "hybrid attention architecture" in related models further suggests integration of attention mechanisms within this sparse expert framework.
  1. Scalable RL Generalization: The model's robustness and adaptability in real-world scenarios are significantly enhanced through reinforcement learning (RL) scaled across million-agent environments. This involves training the model as an agent in progressively complex task distributions. The large-scale RL training, likely utilizing advanced RL algorithms (e.g., PPO, SAC) and large-scale simulation, allows the model to learn robust policies and generalize effectively to unseen situations and diverse task requirements by optimizing for long-term rewards in complex interactive environments.
  1. Global Linguistic Coverage: Qwen3.5 demonstrates expanded support for 201 languages and dialects. This broad linguistic coverage is achieved through comprehensive multilingual training datasets and robust tokenizer design, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
  1. Next-Generation Training Infrastructure: Supporting these advancements is a sophisticated training infrastructure. This infrastructure enables near-100% multimodal training efficiency compared to text-only training, meaning the overhead for incorporating visual data into the training pipeline is nearly negligible. Furthermore, it incorporates asynchronous RL frameworks that support massive-scale agent scaffolds and environment orchestration, crucial for parallelizing and scaling the reinforcement learning process across numerous agents and complex simulations.

The initial public release is the Qwen3.5-397B-A17B MoE model, with subsequent models of varying sizes planned. The model weights are accessible via Hugging Face Hub and ModelScope.