Skip to content
Back to Blog
Model Releases

Zhipu AI Unveils GLM-4.5 Air: The Efficient 106B MoE Powerhouse

Zhipu AI releases GLM-4.5 Air, a lightweight 106B MoE variant optimized for efficient inference on H20 GPUs with open-source weights.

July 28, 2025
Model ReleaseGLM-4.5 Air
GLM-4.5 Air - official image

Introduction

In the rapidly evolving landscape of large language models, Zhipu AI has once again pushed the boundaries of efficiency and capability with the release of GLM-4.5 Air. Announced on July 28, 2025, this model represents a strategic shift towards democratizing high-performance AI through optimized inference costs and open-source accessibility. While the flagship GLM-4.5 boasts massive parameters, the Air variant targets developers who need the intelligence of a 100B+ model without the prohibitive hardware requirements.

This release is particularly significant for startups and enterprises looking to deploy agentic workflows without the overhead of massive clusters. By leveraging a Mixture of Experts (MoE) architecture, GLM-4.5 Air delivers competitive performance while maintaining a lean footprint. It is designed to run efficiently on standard hardware configurations, such as 8x H20 GPUs, making it a viable choice for production environments where cost-per-token is a critical metric.

  • Release Date: July 28, 2025
  • Provider: Zhipu AI
  • License: MIT (Open Source)
  • Target Hardware: 8x H20 GPUs

Key Features & Architecture

The architecture of GLM-4.5 Air is built on a 106B parameter MoE structure, which allows the model to activate only a subset of parameters per token, significantly reducing computational load during inference. This design choice is critical for maintaining low latency while processing complex reasoning tasks. Unlike dense models, the MoE approach ensures that the model remains lightweight without sacrificing the depth of knowledge required for advanced coding and agent tasks.

Developers will appreciate the hybrid reasoning capabilities embedded in the model. It supports both thinking modes for complex problem-solving and non-thinking modes for immediate responses. The model is fully open-source, allowing for fine-tuning on proprietary datasets and deployment on private infrastructure. This flexibility is a key differentiator in the current market where data privacy is paramount.

  • Parameters: 106B MoE
  • Context Window: 128k tokens
  • Max Output: 32k tokens
  • Architecture: Mixture of Experts

Performance & Benchmarks

Performance-wise, GLM-4.5 Air has been rigorously tested across industry-standard benchmarks. According to LLM-Stats data, the model achieved a score of 59.8 across 12 benchmarks, ranking 6th globally in its category. This score is competitive with other open-source models of similar size, proving that the 106B parameter count is sufficient for high-quality generation. The model excels in coding benchmarks, matching the capabilities of the larger GLM-4.5 flagship in many synthetic tasks.

Inference speed is another critical metric. On an 8x H20 GPU setup, the model achieves stable throughput suitable for production API endpoints. The MoE architecture allows for dynamic token routing, ensuring that the model focuses compute resources only where necessary. This efficiency translates to lower costs per token and faster response times compared to dense models of equivalent intelligence.

  • Benchmark Score: 59.8 (12 benchmarks)
  • Inference Hardware: 8x H20 GPUs
  • Coding Performance: Comparable to GLM-4.5
  • Latency: Optimized for low-latency responses

API Pricing

For developers accessing the model via Zhipu's cloud API, pricing is structured to reflect the efficiency of the Air variant. The input cost is set at $0.50 per million tokens, while the output cost is $1.50 per million tokens. This pricing model is significantly more cost-effective than the 355B parameter GLM-4.5 flagship, making it ideal for high-volume applications. Additionally, the open-source nature of the model allows for self-hosting, eliminating API fees for those with sufficient infrastructure.

Zhipu also offers a free tier for developers to test the model's capabilities. This tier allows for limited requests per month, enabling engineers to evaluate performance before committing to a paid plan. The value proposition is clear: access to frontier AI capabilities at a fraction of the cost of proprietary alternatives like GPT-4 or Claude Opus, provided the deployment architecture aligns with the model's requirements.

  • Input Price: $0.50 / 1M tokens
  • Output Price: $1.50 / 1M tokens
  • Free Tier: Available for testing
  • Self-Hosted: MIT License

Use Cases

GLM-4.5 Air is best suited for applications requiring high intelligence but constrained by budget or hardware. It is particularly effective for AI coding agents that need to run autonomously for extended periods. The model's ability to handle long-running tasks makes it a strong candidate for software development workflows, where it can assist in debugging, code generation, and refactoring without requiring constant human intervention.

Beyond coding, the model is excellent for RAG (Retrieval-Augmented Generation) systems. Its 128k context window allows it to ingest large documentation sets and answer questions accurately. For chatbots and customer support agents, the hybrid reasoning mode ensures that the model can handle both simple queries and complex, multi-step problems efficiently.

  • AI Coding Agents
  • RAG Systems
  • Customer Support Chatbots
  • Long-Document Analysis

Getting Started

Accessing GLM-4.5 Air is straightforward for developers familiar with Zhipu's ecosystem. The model weights are available on Hugging Face under the MIT license, allowing for immediate download and local deployment. For API access, developers can sign up at the official Zhipu AI portal to obtain an API key. The SDK supports multiple languages, including Python and JavaScript, facilitating quick integration into existing applications.

Documentation is comprehensive, providing guides on fine-tuning, quantization, and deployment optimization. The official repository includes examples of running the model on 8x H20 GPUs, ensuring that engineers can replicate the recommended inference setup. By leveraging these resources, teams can rapidly prototype and deploy AI solutions using this powerful, open-source model.

  • API Endpoint: Zhipu Cloud
  • SDK: Python, JavaScript
  • Weights: Hugging Face
  • Docs: glm45.org

Comparison

API Pricing β€” Input: 0.2 / Output: 1.1 / Context: 128k


Sources

GLM-4.5 Official Documentation

GLM-4.5 Air Benchmarks

Zhipu AI Releases GLM-5