#vllm #llm #ai #opensource | vLLM
News

#vllm #llm #ai #opensource | vLLM

vLLM
2026.01.21
·LinkedIn·by 이호민
#vLLM#LLM#AI#OpenSource

Key Points

  • 1vLLM v0.14.0 introduces significant changes including default async scheduling, a PyTorch 2.9.1 requirement, and removal of deprecated quantization, while adding a gRPC server and `--max-model-len auto` for efficient GPU memory usage.
  • 2The release expands model compatibility to include Grok-2 and various multimodal architectures, alongside MoE LoRA support for models like LLaVA, and enhances performance with CUTLASS MoE optimizations.
  • 3Hardware support is updated for SM103 and B300 Blackwell, with new large-scale serving features like Extended Dual-Batch Overlap (XBO) and NIXL asymmetric TP to improve efficiency.

The vLLM v0.14.0 release introduces significant enhancements across performance, model compatibility, and system architecture, incorporating 660 commits from 251 contributors. This update includes several breaking changes, necessitating careful review before upgrading.

Key Breaking Changes:

  1. Asynchronous Scheduling as Default: Asynchronous request scheduling is now enabled by default, which can be explicitly disabled using the --no-async-scheduling flag. This shift aims to improve concurrency and resource utilization by allowing non-blocking operations.
  2. PyTorch Version Requirement: The minimum required PyTorch version is now 2.0.1 or higher, with the default wheel compiled against cu129 (CUDA 12.9). This ensures compatibility with recent PyTorch features and CUDA capabilities.
  3. Quantization Scheme Removal: Deprecated quantization schemes have been removed to streamline the codebase and focus on actively supported methods.
  4. Speculative Decoding Error Handling: Speculative decoding, when encountering unsupported sampling parameters, will now explicitly fail rather than silently ignoring these parameters. This change improves robustness and prevents unexpected behavior.

Core Methodological and Architectural Improvements:

  1. gRPC Server Entrypoint: A new gRPC server entrypoint is introduced, leveraging a binary protocol and HTTP/2 multiplexing. This provides a high-throughput serving mechanism, enabling more efficient communication between clients and the vLLM server, particularly beneficial for microservices architectures and high-load environments. The binary nature reduces overhead compared to text-based protocols, while HTTP/2 multiplexing allows multiple requests over a single connection, improving efficiency.
  2. Automatic Context Length Management (--max-model-len auto): This feature automatically adjusts the model's maximum context length to fit the available GPU memory, thereby mitigating Out-of-Memory (OOM) errors during startup. This implies an intelligent memory profiling or estimation mechanism that dynamically configures the maximum sequence length based on the detected GPU hardware and memory capacity.
  3. Model Inspection View: A new model inspection view is available by setting VLLMLOGMODELINSPECTION=1VLLM_LOG_MODEL_INSPECTION=1 or printing the LLM object. This feature allows users to programmatically or via logs inspect internal model components, including modules, attention backends, and applied quantization methods, aiding in debugging and understanding model configurations.

Extended Model Support:

The release significantly expands support for various large language model (LLM) architectures and modalities, including:

  • Grok-2, with integration of its tiktoken tokenizer.
  • LFM2-VL, a vision-language model.
  • MiMo-V2-Flash.
  • GLM-ASR, an audio-based model.
  • K-EXAONE-236B-A23B, a Mixture-of-Experts (MoE) architecture.
Additionally, LoRA (Low-Rank Adaptation) now supports multimodal tower/connector structures, enhancing adaptability for models such as LLaVA, BLIP2, PaliGemma, and Pixtral. This allows for efficient fine-tuning of specific components within complex multimodal architectures.

Performance Optimizations:

Several low-level and large-scale serving optimizations have been implemented:

  • CUTLASS MoE Optimizations: Specific optimizations for Mixture-of-Experts (MoE) models using the CUTLASS library yield a 2.9% improvement in throughput and a 10.8% reduction in Time-To-First-Token (TTFT). This is attributed to a fill(0) optimization, likely a memory initialization or kernel fusion technique that reduces latency.
  • Hardware-Specific Enhancements:
    • Support for NVIDIA SM103 (e.g., Blackwell architecture).
    • Specific B300 Blackwell MoE configurations to leverage the unique capabilities of these high-performance GPUs.
    • Integration of Marlin, an optimized kernel for dense matrix multiplication, for Turing (sm75) architectures, enhancing inference performance on older GPU generations.
  • Large-Scale Serving Techniques:
    • XBO (Extended Dual-Batch Overlap): This technique likely optimizes GPU utilization by overlapping computation and communication/data transfer across two batches, maximizing throughput in multi-GPU or distributed settings.
    • NIXL asymmetric TP (Tensor Parallelism): This refers to an advanced form of tensor parallelism, possibly involving non-uniform distribution or specialized communication patterns, to optimize large model serving across multiple devices or nodes, particularly where GPU resources are asymmetric.