GLM-5: From Vibe Coding to Agentic Engineering
Key Points
- 1GLM-5 is a new large language model scaling to 744B parameters and 28.5T pre-training tokens, integrating DeepSeek Sparse Attention and a novel `slime` RL infrastructure for improved efficiency.
- 2Designed for complex systems engineering and long-horizon agentic tasks, GLM-5 demonstrates significant performance improvements across a wide range of academic benchmarks.
- 3It achieves best-in-class performance among open-source models in reasoning, coding, and agentic tasks, closing the gap with frontier models, and is available open-source as well as via APIs.
GLM-5 is a newly launched large language model targeting complex systems engineering and long-horizon agentic tasks, representing an advancement over its predecessor, GLM-4.7, and aiming to narrow the gap with frontier models.
Core Methodology and Architecture:
GLM-5 leverages scaling as a primary method for intelligence improvement. It significantly increases its parameter count from 355B (32B active) in GLM-4.5 to 744B (40B active). The pre-training dataset has also been expanded from 23T to 28.5T tokens. A key architectural enhancement is the integration of DeepSeek Sparse Attention (DSA), which is designed to substantially reduce deployment costs while preserving long-context capacity. This sparse attention mechanism likely optimizes computational efficiency by focusing attention on relevant token subsets within extended contexts, contrasting with dense attention mechanisms that incur quadratic computational complexity with context length.
Training and Post-Training Innovations:
To bridge the gap between initial model competence and refined excellence, GLM-5 incorporates advanced reinforcement learning (RL) techniques. Recognizing the inherent inefficiency of deploying RL at scale for large language models, the developers introduced slime, a novel asynchronous RL infrastructure. This infrastructure is specifically engineered to significantly improve RL training throughput and efficiency. The asynchronous nature allows for parallelization of RL training components, reducing wall-clock time and resource contention, thereby enabling more fine-grained and frequent post-training iterations. This is crucial for optimizing model behavior, particularly for complex, long-horizon tasks where subtle behavioral nuances are critical.
Performance and Benchmarking:
GLM-5 demonstrates significant performance improvements across a wide range of academic benchmarks and achieves best-in-class performance among open-source models in reasoning, coding, and agentic tasks.
- Internal Evaluation: On the internal CC-Bench-V2 suite, GLM-5 substantially outperforms GLM-4.7 across frontend, backend, and long-horizon tasks, approaching the performance of Claude Opus 4.5.
- Long-Term Operational Capability: Vending Bench 2, a benchmark assessing long-term operational capabilities, places GLM-5 as the top open-source model. It concludes a simulated one-year vending machine business with a final account balance of 4,967.06) and Gemini 3.0 Pro ($5,478.16).
- Reasoning: GLM-5 scores 30.5 on Humanity's Last Exam (HLE) and 50.4 with tools, showing strong performance compared to peers, though slightly trailing frontier models like GPT-5.2 (xhigh) and Kimi K2.5 on HLE. For tasks like HMMT Nov. 2025 and IMOAnswerBench, it often exceeds or closely matches other models.
- Coding: On SWE-bench Verified, GLM-5 achieves 77.8%, while on SWE-bench Multilingual it reaches 73.3%. Terminal-Bench 2.0 (Terminus-2 and Claude Code variants) shows strong performance (e.g., 56.2% / 60.7% on Terminus-2). CyberGym, evaluating cybersecurity tasks, registers 43.2%.
- General Agentic Tasks: GLM-5 scores 62.0% on BrowseComp and 75.9% with context management. It achieves 89.7% on -Bench and 67.8% on MCP-Atlas Public Set, and 38.0% on Tool-Decathlon.
Key Capabilities and Applications:
GLM-5 is designed to transition foundational models from conversational interfaces ("chat") to practical "work" tools, akin to office applications for knowledge workers. It can directly convert text or source materials into structured document formats such as .docx, .pdf, and .xlsx (e.g., PRDs, lesson plans, financial reports). The official application, Z.ai, integrates an Agent mode with these built-in skills, supporting multi-turn collaboration and producing real, ready-to-use deliverables.
Accessibility and Deployment:
GLM-5 is open-sourced, with model weights available on Hugging Face and ModelScope under the MIT License. It is also accessible via developer platforms like api.z.ai and BigModel.cn, with compatibility for existing agent frameworks such as Claude Code and OpenClaw. For local deployment, GLM-5 supports inference frameworks including vLLM and SGLang, with comprehensive instructions on its GitHub repository. A notable feature is its support for non-NVIDIA hardware, including Huawei Ascend, Moore Threads, Cambricon, Kunlun Chip, MetaX, Enflame, and Hygon, achieving reasonable throughput through kernel optimization and model quantization. GLM-5 is also available for direct interaction through Z.ai in both Chat Mode (instant response) and Agent Mode (tool-equipped, delivering results).
Evaluation Details (Footnotes):
The evaluation protocols involve specific settings for each benchmark to ensure rigor and reproducibility. For instance, Humanity's Last Exam (HLE) uses a maximum generation length of 131,072 tokens (, ), with GPT-5.2 (medium) serving as the judge model. HLE-with-tools uses a maximum context length of 202,752 tokens. SWE-bench evaluations use OpenHands with , , , and a 200K context window. Terminal-Bench 2.0 (Terminus 2) uses a 2-hour timeout, , , , and a 128K context window, with resource limits of 16 CPUs and 32 GB RAM. CyberGym evaluations are performed in Claude Code (think mode, no web tools) with specific , , and a 250-minute timeout per task, reporting Pass@1 over 1,507 tasks. MCP-Atlas uses Gemini 3 Pro as the judge model.