zai-org/GLM-4.7-Flash · Hugging Face
Blog

zai-org/GLM-4.7-Flash · Hugging Face

2026.01.20
·Hugging Face·by 권준호
#LLM#Transformers#MoE#Text Generation#Conversational AI

Key Points

  • 1GLM-4.7-Flash is introduced as a 30B-A3B Mixture-of-Experts (MoE) model, positioned as the strongest in its class for balancing performance and efficiency.
  • 2It achieves leading results on various benchmarks, significantly outperforming models like Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B in tasks such as AIME, GPQA, and SWE-bench Verified.
  • 3The paper provides detailed instructions and code examples for lightweight local deployment using popular inference frameworks like vLLM and SGLang.

GLM-4.7-Flash is a 30-billion parameter Mixture-of-Experts (MoE) model, specifically a 30B-A3B architecture. Developed as a lightweight deployment option, it aims to balance performance and efficiency, asserting itself as a leading model in the 30B class. The model supports both English and Chinese languages.

The paper presents comparative performance benchmarks against Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. GLM-4.7-Flash demonstrates competitive, and in several cases superior, performance across a range of tasks:

  • AIME 25: 91.6 (compared to Qwen3's 85.0 and GPT-OSS's 91.7)
  • GPQA: 75.2 (compared to Qwen3's 73.4 and GPT-OSS's 71.5)
  • LCB v6: 64.0 (compared to Qwen3's 66.0 and GPT-OSS's 61.0)
  • HLE: 14.4 (compared to Qwen3's 9.8 and GPT-OSS's 10.9)
  • SWE-bench Verified: 59.2 (significantly higher than Qwen3's 22.0 and GPT-OSS's 34.0)
  • τ2\tau^2-Bench: 79.5 (significantly higher than Qwen3's 49.0 and GPT-OSS's 47.7)
  • BrowseComp: 42.8 (significantly higher than Qwen3's 2.29 and GPT-OSS's 28.3)

The core methodology for local deployment and inference of GLM-4.7-Flash involves utilizing optimized inference frameworks such as vLLM and SGLang, alongside the standard Hugging Face Transformers library.

Local Deployment with Hugging Face Transformers:
The model can be loaded and used for text generation via the transformers library. This involves:

  1. Installation: Ensure the latest transformers from the main branch is installed: pip install git+https://github.com/huggingface/transformers.git.
  2. Model and Tokenizer Loading:
pythonimport torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_PATH = "zai-org/GLM-4.7-Flash" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( pretrained_model_name_or_path=MODEL_PATH, torch_dtype=torch.bfloat16, # Utilizes BFloat16 for memory efficiency and speed device_map="auto", # Automatically maps model layers to available devices (e.g., GPUs) )
  1. Chat Template Application and Generation:
pythonmessages = [{"role": "user", "content": "hello"}] inputs = tokenizer.apply_chat_template( messages, tokenize=True, # Tokenize the input messages add_generation_prompt=True, # Add a prompt token for text generation return_dict=True, return_tensors="pt", # Return PyTorch tensors ) inputs = inputs.to(model.device) # Move inputs to the model's device generated_ids = model.generate( **inputs, max_new_tokens=128, # Limit the maximum number of new tokens generated do_sample=False # Use greedy decoding (no sampling) ) output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:]) print(output_text)

This method provides a foundational approach for direct inference using the transformers library, leveraging automatic device mapping and bfloat16 precision for efficiency.

Optimized Deployment with vLLM:
vLLM is an open-source library for high-throughput and low-latency inference. For GLM-4.7-Flash, vLLM can be set up as a serving endpoint:

  1. Installation: pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly.
  2. Serving Command:
bashvllm serve zai-org/GLM-4.7-Flash \ --tensor-parallel-size 4 \ # Distributes the model across 4 GPUs for faster inference --speculative-config.method mtp \ # Specifies Multi-Token Prediction (MTP) for speculative decoding --speculative-config.num_speculative_tokens 1 \ # Number of tokens to speculatively generate --tool-call-parser glm47 \ # Enables parsing of tool calls specific to GLM-4.7 format --reasoning-parser glm45 \ # Enables parsing of reasoning steps specific to GLM-4.5 format --enable-auto-tool-choice \ # Automatically chooses the appropriate tool --served-model-name glm-4.7-flash

These parameters enable efficient serving with tensor parallelism, speculative decoding, and specialized parsing for tool calls and reasoning, enhancing the model's agentic capabilities.

Optimized Deployment with SGLang:
SGLang is another framework for efficient LLM serving, particularly for structured generation.

  1. Installation: uvpipinstallsglang==0.3.2.dev9039+pr17247.g90c446848extraindexurlhttps://sglproject.github.io/whl/pr/uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/.
  2. Serving Command:
bashpython3 -m sglang.launch_server \ --model-path zai-org/GLM-4.7-Flash \ --tp-size 4 \ # Tensor parallelism across 4 GPUs --tool-call-parser glm47 \ # GLM-4.7 specific tool call parsing --reasoning-parser glm45 \ # GLM-4.5 specific reasoning parsing --speculative-algorithm EAGLE \ # Uses EAGLE algorithm for speculative decoding --speculative-num-steps 3 \ # Number of decoding steps in EAGLE --speculative-eagle-topk 1 \ # Top-k sampling for EAGLE --speculative-num-draft-tokens 4 \ # Number of draft tokens generated --mem-fraction-static 0.8 \ # Allocates 80% of memory statically --served-model-name glm-4.7-flash \ --host 0.0.0.0 \ # Binds the server to all available network interfaces --port 8000

For Blackwell GPUs, additional arguments --attention-backend triton --speculative-draft-attention-backend triton are recommended for enhanced attention performance. SGLang's configuration emphasizes speculative decoding using the EAGLE algorithm and fine-grained control over memory allocation and serving parameters.

The model is associated with the paper "GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models" (arXiv:2508.06471), published in 2025 by the GLM Team and numerous collaborators.