Google DeepMind releases Gemma 2. 9B and 27B models outperform 2x larger competitors. Apache 2.0 license. Build better AI apps today.

Google DeepMind has officially released Gemma 2, marking a significant milestone in the open-source AI landscape. Released on June 27, 2024, this new family of models represents a substantial leap forward in efficiency and capability compared to its predecessor. For developers and engineers, this means access to state-of-the-art intelligence without the restrictive licensing terms often associated with proprietary large language models.
The release is particularly notable because Gemma 2 achieves performance levels that rival closed-source models significantly larger than itself. By leveraging advanced knowledge distillation techniques from Google's Gemini ecosystem, Gemma 2 delivers frontier AI capabilities on a single GPU, making it a viable option for local deployment and edge computing scenarios. This democratization of high-performance AI is expected to accelerate innovation across the developer community.
Gemma 2 introduces two primary parameter sizes: 9 billion and 27 billion. The 27B model utilizes a Mixture of Experts (MoE) architecture, activating only a fraction of its parameters during inference to reduce computational load. This architectural choice allows the model to maintain high performance while consuming fewer resources compared to dense models of similar scale.
A core differentiator for Gemma 2 is its training methodology. The models undergo knowledge distillation from Gemini, Google's proprietary large language model. This process transfers complex reasoning patterns and linguistic nuances into the open weights, ensuring Gemma 2 performs robustly on complex tasks. The models support a context window of up to 131,072 tokens, enabling long-form document analysis and extended conversation history without losing coherence.
In independent evaluations, Gemma 2 consistently outperforms models twice its size. On the MMLU benchmark, the 27B model achieves scores comparable to much larger proprietary models, demonstrating superior reasoning capabilities. The HumanEval benchmark shows significant gains in code generation tasks, validating its utility for software engineering workflows.
Specific benchmark scores highlight the efficiency gains. On the SWE-bench (Software Engineering) leaderboard, Gemma 2 shows marked improvements in solving real-world GitHub issues. These metrics confirm that the distillation from Gemini was successful in transferring high-level reasoning skills. Developers can expect reliable performance on tasks ranging from mathematical problem solving to creative writing.
As an open-source model under the Apache 2.0 license, Gemma 2 does not carry a direct API subscription fee from Google. Developers can download the weights and host the model on their own infrastructure, effectively eliminating per-token input and output costs. This is a critical value proposition for startups and enterprises looking to avoid vendor lock-in and control their inference costs.
However, if accessed via Google Cloud Vertex AI, standard pricing rules for GPU compute apply. For self-hosted deployments using the open weights, the marginal cost is limited to your cloud infrastructure expenses. This structure allows for flexible scaling, where you pay only for the compute resources you utilize, rather than a flat API rate.
When evaluating Gemma 2 against direct competitors, the value proposition becomes clear. While Llama 3 70B offers broad knowledge, Gemma 2 27B provides comparable reasoning with a smaller footprint. Mixtral 8x7B is efficient but lacks the long-context capabilities of Gemma 2. The comparison below details the technical specifications that matter most for production deployment.
Gemma 2 is exceptionally well-suited for applications requiring advanced reasoning and agentic workflows. In coding environments, the model's ability to understand complex logic makes it ideal for pair programming and code refactoring. For Retrieval-Augmented Generation (RAG) systems, the extended context window allows for more accurate information synthesis from large document repositories.
Agentic workflows benefit significantly from the model's instruction-following capabilities. Developers can build autonomous agents that plan and execute multi-step tasks with high reliability. Additionally, the model's multimodal data processing support facilitates integration into systems that require handling diverse data types, from text to structured data.
Accessing Gemma 2 is straightforward for developers familiar with standard ML pipelines. The weights are available on Hugging Face and can be loaded using the Hugging Face Transformers library. For production environments, Google provides optimized inference pipelines on Vertex AI that reduce latency and improve throughput.
To begin, clone the official repository and download the model weights. Ensure your environment has PyTorch installed and sufficient GPU memory for the 27B variant. The documentation includes examples for fine-tuning on custom datasets, allowing teams to adapt the model to specific domain requirements quickly.
API Pricing β Input: Free (Open Weights) / Output: Free (Open Weights) / Context: 8192 tokens (Base), 131072 tokens (Extended)