zai-org/GLM-4.7-Flash · Hugging Face
Key Points
- 1GLM-4.7-Flash is introduced as a 30B-A3B Mixture-of-Experts (MoE) model, positioned as the strongest in its class for balancing performance and efficiency.
- 2It achieves leading results on various benchmarks, significantly outperforming models like Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B in tasks such as AIME, GPQA, and SWE-bench Verified.
- 3The paper provides detailed instructions and code examples for lightweight local deployment using popular inference frameworks like vLLM and SGLang.
GLM-4.7-Flash is a 30-billion parameter Mixture-of-Experts (MoE) model, specifically a 30B-A3B architecture. Developed as a lightweight deployment option, it aims to balance performance and efficiency, asserting itself as a leading model in the 30B class. The model supports both English and Chinese languages.
The paper presents comparative performance benchmarks against Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash demonstrates competitive, and in several cases superior, performance across a range of tasks:
- AIME 25: 91.6 (compared to Qwen3's 85.0 and GPT-OSS's 91.7)
- GPQA: 75.2 (compared to Qwen3's 73.4 and GPT-OSS's 71.5)
- LCB v6: 64.0 (compared to Qwen3's 66.0 and GPT-OSS's 61.0)
- HLE: 14.4 (compared to Qwen3's 9.8 and GPT-OSS's 10.9)
- SWE-bench Verified: 59.2 (significantly higher than Qwen3's 22.0 and GPT-OSS's 34.0)
- -Bench: 79.5 (significantly higher than Qwen3's 49.0 and GPT-OSS's 47.7)
- BrowseComp: 42.8 (significantly higher than Qwen3's 2.29 and GPT-OSS's 28.3)
The core methodology for local deployment and inference of GLM-4.7-Flash involves utilizing optimized inference frameworks such as vLLM and SGLang, alongside the standard Hugging Face Transformers library.
Local Deployment with Hugging Face Transformers:
The model can be loaded and used for text generation via the transformers library. This involves:
- Installation: Ensure the latest
transformersfrom the main branch is installed:pip install git+https://github.com/huggingface/transformers.git. - Model and Tokenizer Loading:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "zai-org/GLM-4.7-Flash"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype=torch.bfloat16, # Utilizes BFloat16 for memory efficiency and speed
device_map="auto", # Automatically maps model layers to available devices (e.g., GPUs)
)- Chat Template Application and Generation:
messages = [{"role": "user", "content": "hello"}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True, # Tokenize the input messages
add_generation_prompt=True, # Add a prompt token for text generation
return_dict=True,
return_tensors="pt", # Return PyTorch tensors
)
inputs = inputs.to(model.device) # Move inputs to the model's device
generated_ids = model.generate(
**inputs,
max_new_tokens=128, # Limit the maximum number of new tokens generated
do_sample=False # Use greedy decoding (no sampling)
)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output_text)This method provides a foundational approach for direct inference using the
transformers library, leveraging automatic device mapping and bfloat16 precision for efficiency.Optimized Deployment with vLLM:
vLLM is an open-source library for high-throughput and low-latency inference. For GLM-4.7-Flash, vLLM can be set up as a serving endpoint:
- Installation:
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly. - Serving Command:
vllm serve zai-org/GLM-4.7-Flash \
--tensor-parallel-size 4 \ # Distributes the model across 4 GPUs for faster inference
--speculative-config.method mtp \ # Specifies Multi-Token Prediction (MTP) for speculative decoding
--speculative-config.num_speculative_tokens 1 \ # Number of tokens to speculatively generate
--tool-call-parser glm47 \ # Enables parsing of tool calls specific to GLM-4.7 format
--reasoning-parser glm45 \ # Enables parsing of reasoning steps specific to GLM-4.5 format
--enable-auto-tool-choice \ # Automatically chooses the appropriate tool
--served-model-name glm-4.7-flashThese parameters enable efficient serving with tensor parallelism, speculative decoding, and specialized parsing for tool calls and reasoning, enhancing the model's agentic capabilities.
Optimized Deployment with SGLang:
SGLang is another framework for efficient LLM serving, particularly for structured generation.
- Installation: .
- Serving Command:
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.7-Flash \
--tp-size 4 \ # Tensor parallelism across 4 GPUs
--tool-call-parser glm47 \ # GLM-4.7 specific tool call parsing
--reasoning-parser glm45 \ # GLM-4.5 specific reasoning parsing
--speculative-algorithm EAGLE \ # Uses EAGLE algorithm for speculative decoding
--speculative-num-steps 3 \ # Number of decoding steps in EAGLE
--speculative-eagle-topk 1 \ # Top-k sampling for EAGLE
--speculative-num-draft-tokens 4 \ # Number of draft tokens generated
--mem-fraction-static 0.8 \ # Allocates 80% of memory statically
--served-model-name glm-4.7-flash \
--host 0.0.0.0 \ # Binds the server to all available network interfaces
--port 8000For Blackwell GPUs, additional arguments
--attention-backend triton --speculative-draft-attention-backend triton are recommended for enhanced attention performance. SGLang's configuration emphasizes speculative decoding using the EAGLE algorithm and fine-grained control over memory allocation and serving parameters.The model is associated with the paper "GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models" (arXiv:2508.06471), published in 2025 by the GLM Team and numerous collaborators.