GLM-4.7-Flash 모델 공개 | GeekNews
News

GLM-4.7-Flash 모델 공개 | GeekNews

xguru
2026.01.23
·News·by 배레온/부산/개발자
#LLM#AI#Open Source#Model#Flash

Key Points

  • 1GLM-4.7-Flash is a new 30B-A3B Mixture-of-Experts (MoE) model engineered for lightweight deployment, offering a balance of strong performance and efficiency for various tasks.
  • 2It demonstrates competitive benchmark results on AIME 25, GPQA, and SWE-bench, positioning it favorably against 30B-class models, especially for coding, inference, and generation.
  • 3The model supports efficient local deployment via frameworks like vLLM and SGLang, including quantized versions for consumer hardware, making advanced AI more accessible despite mixed user feedback on real-world quality compared to top-tier models.

The GLM-4.7-Flash is a large language model characterized by a 30B-A3B Mixture-of-Experts (MoE) architecture, indicating a total of 30 billion parameters with 3 billion active parameters (A3BA3B meaning 3 billion active parameters out of 30B30B total parameters). This design aims to balance performance and efficiency, making it suitable for lightweight deployment and efficient utilization of large models. The core methodology of MoE allows the model to achieve a large capacity (total parameters) while benefiting from faster inference and reduced computational cost by only activating a subset of experts (active parameters) for any given input. This implies that a router network dynamically selects which of the numerous expert sub-networks process a particular input token, leading to sparse activation.

The model is positioned as a competitive option among 30B-class models, aiming for state-of-the-art performance. It demonstrates strong results across various benchmarks:

  • AIME 25: 91.6 (compared to Qwen3-30B-A3B-Thinking-2507 at 85.0 and GPT-OSS-20B at 91.7)
  • GPQA: 75.2 (noted as higher than comparison models)
  • LCB v6: 64.0
  • HLE: 14.4
  • SWE-bench Verified: 59.2 (noted as significantly different from other models, and even surpassing Qwen3-Coder 480B's 55.4, though concerns about SWE-bench Verified's reliability due to data memorization were raised)
  • τ2\tau^2-Bench: 79.5
  • BrowseComp: 42.8

GLM-4.7-Flash supports inference frameworks like vLLM and SGLang (main branch only), emphasizing efficient local deployment. For local execution, it typically requires approximately 24GB of VRAM or 32GB of RAM on macOS, though its MoE structure means only the 3B active parameters need to be loaded into VRAM for active processing, potentially allowing for optimization techniques where only frequently used layers are loaded. It supports a 128k context window when VRAM is sufficient.

The model is available in quantized formats, particularly GGUF (e.g., 4-bit Q4_K_M quantization), enabling deployment via tools like llama.cpp and ollama, or through interfaces like LM Studio. It is specifically highlighted for low latency and high throughput in coding, reasoning, and generation tasks, also showing strong capabilities in translation, role-playing, and aesthetic generation.

User feedback indicates a strong price-to-performance ratio, particularly for general tasks and coding. While benchmarks suggest high performance, some users found its real-world performance, especially in instruction following, to be not on par with models like Claude Sonnet or Opus. However, it is seen as a solid incremental improvement and a viable self-hosting option that significantly reduces LLM-as-a-service costs compared to higher-tier models like Haiku. The model version, "-Flash," is noted to have skipped 4.6 and is described as being on par with Anthropic's Haiku.