Zhipu AI (Z.ai) has announced the release of GLM-4.7-Flash, an addition to its GLM 4.7 model series, specifically designed for developers prioritizing both strong performance and efficient local execution. The company characterizes GLM-4.7-Flash as a 30B-A3B MoE model, positioning it as a highly capable option within the 30-billion parameter class for lightweight deployments where performance and operational efficiency are critical.
Model Overview and Positioning
GLM-4.7-Flash functions as a text generation model featuring 31 billion parameters, supporting BF16 and F32 tensor types. Its architecture is designated as glm4_moe_lite, and it offers multilingual support for both English and Chinese, optimized for conversational interactions. This new model joins the GLM-4.7 collection alongside its larger counterparts, GLM-4.7 and GLM-4.7-FP8.
Z.ai presents GLM-4.7-Flash as an accessible and lightweight deployment alternative to the comprehensive GLM-4.7 model. It maintains a focus on key tasks such as coding, complex reasoning, and general text generation. This makes it an attractive solution for developers who may face constraints with deploying much larger, 358-billion parameter class models but still seek the advantages of a modern Mixture-of-Experts design and competitive benchmark outcomes.
Architectural Innovations and Context Handling
The Mixture-of-Experts (MoE) architecture employed by GLM-4.7-Flash enables the model to house a greater number of parameters than it actively uses for processing each individual token. This approach facilitates specialized expertise across different 'experts' while ensuring that the effective computational load per token remains comparable to that of a smaller, dense model.
GLM-4.7-Flash boasts a substantial context length of 128,000 tokens, which contributes to its strong performance on coding benchmarks relative to models of a similar scale. This extensive context capacity is particularly advantageous for navigating large codebases, multi-file repositories, and lengthy technical documentation, scenarios where many existing models typically require aggressive data chunking. Furthermore, the model utilizes a standard causal language modeling interface and a chat template, ensuring seamless integration into existing large language model (LLM) stacks with minimal modifications.
Performance Benchmarks in the 30B Class
Comparisons conducted by the Z.ai team pitted GLM-4.7-Flash against notable models such as Qwen3-30B-A3B-Thinking-2507 and GPT-OSS-20B. The results indicate that GLM-4.7-Flash either leads or achieves competitive scores across a diverse range of benchmarks, including those for mathematics, reasoning, long-horizon tasks, and coding agent capabilities. These evaluations underscore the model's dual strengths: its compact deployment footprint and its high-performing nature on established coding and agent-centric benchmarks.
Configuration for Optimal Use
For most general tasks, the recommended default settings for GLM-4.7-Flash are a temperature of 1.0, top_p of 0.95, and a maximum of 131,072 new tokens. This configuration enables a relatively open sampling regime with a substantial budget for generation. For more demanding tasks like Terminal Bench and SWE-bench Verified, stricter settings are advised: a temperature of 0.7, top_p of 1.0, and a maximum of 16,384 new tokens. For τ²-Bench, a temperature of 0 and 16,384 maximum new tokens are used. These reduced randomness settings are crucial for tasks requiring stable tool utilization and multi-step interactions.
Additionally, Z.ai recommends activating 'Preserved Thinking' mode for multi-turn agentic tasks such as τ²-Bench and Terminal Bench 2. This mode maintains internal reasoning traces across multiple turns, which is highly beneficial for developing agents that necessitate extended sequences of function calls and iterative corrections.
Developer Advantages
GLM-4.7-Flash consolidates several key properties that are particularly relevant for agentic and code-focused applications:
- A 30-billion parameter Mixture-of-Experts (MoE) architecture with 31 billion parameters and an expansive 128,000-token context length.
- Impressive benchmark results on evaluations like AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp, demonstrating strong performance against comparative models.
- Clearly documented evaluation parameters and a 'Preserved Thinking' mode for managing complex multi-turn agent tasks.
- First-class support for popular inference frameworks, including vLLM, SGLang, and Transformers, complete with ready-to-use commands for deployment.
- A growing collection of finetunes and quantizations, including MLX conversions, accessible within the Hugging Face ecosystem.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost