Skip to content
Back to Blog
Model Releases

DeepSeek-V4 Release: Open Source & Pricing

DeepSeek unveils V4-Pro and V4-Flash, challenging US dominance with massive context and aggressive pricing.

April 24, 2026
Model ReleaseDeepSeek-V4
DeepSeek V4 logo

Introduction

DeepSeek has officially unveiled its V4 model family, marking a significant milestone in the global AI race. Released on April 24, 2026, this update represents a strategic pivot for the Chinese AI startup, offering high-performance open-source models that rival top-tier closed-source competitors from the US. The release includes two distinct variants: DeepSeek-V4-Pro and DeepSeek-V4-Flash, designed to cater to different deployment needs while maintaining state-of-the-art reasoning capabilities. This announcement intensifies the competition, particularly with its aggressive pricing strategy and massive context window.

The V4 launch follows the V3 model released in December 2024, continuing a trajectory of rapid iteration. By opening the weights on HuggingFace, DeepSeek has committed to a philosophy of accessibility, allowing researchers and engineers to inspect and fine-tune the models. This transparency is rare in the current market, where most high-performance models remain proprietary. The historical significance lies in the fact that this model challenges the narrative that only US-based entities can produce top-tier LLMs at scale.

  • Released on 2026-04-24
  • Two variants: V4-Pro and V4-Flash
  • Open-source weights available on HuggingFace

Key Features & Architecture

The architecture of DeepSeek-V4 leverages a massive MoE structure to balance efficiency and performance. The V4-Pro model utilizes a total of 1.6T parameters with 49B active parameters, while the V4-Flash version operates with 284B total parameters and 13B active parameters. Both models support a context window of 1 million tokens, allowing for the processing of extensive documents and codebases. The output limit extends up to 384K tokens, ensuring long-form generation is feasible.

Beyond standard generation, the models support both thinking mode (default) and non-thinking mode, providing flexibility for latency-sensitive applications versus high-reasoning tasks. The V4-Pro is optimized for complex reasoning, while V4-Flash focuses on speed and cost-efficiency. The model is also compatible with OpenAI and Anthropic API formats, ensuring easy integration into existing workflows without significant refactoring.

  • V4-Pro: 1.6T total / 49B active params
  • V4-Flash: 284B total / 13B active params
  • Context window: 1M tokens
  • Output max: 384K tokens
  • Thinking mode (default) and non-thinking mode

Performance & Benchmarks

In terms of performance, DeepSeek-V4 claims to rival the best closed-source models globally. While specific benchmark scores like MMLU or HumanEval are not explicitly detailed in the release notes provided, the model is noted for enhanced reasoning and autonomous task performance. The V4-Pro specifically targets complex reasoning tasks, showing improvements over the V3 model launched in December 2024. The model is optimized for domestic chips, suggesting improved inference speed on specific hardware architectures compared to previous iterations.

The V4-Pro is reported to be better optimized for Asian nation's domestic chips, which is a significant advantage for enterprises in China and potentially elsewhere. This hardware optimization reduces latency and energy consumption during inference. The model's ability to handle long-context tasks effectively is a major selling point for enterprise applications requiring document retrieval and analysis.

  • Enhanced reasoning capabilities
  • Optimized for domestic chips
  • Rivals best closed-source models
  • Improved over V3 model (Dec 2024)

API Pricing

The pricing strategy is arguably the most disruptive aspect of the release. DeepSeek offers ultra-aggressive rates, with the Flash model at $0.14/M input tokens (cache miss) and $0.028/M (cache hit). Output costs are set at $0.28/M for Flash. For the Pro model, input costs are $1.74/M (cache miss) and $0.145/M (cache hit), with output at $3.48/M. This pricing structure makes the Flash model approximately 7 times cheaper than Claude Opus 4.7, making it highly attractive for high-volume applications.

Cache hit pricing is a crucial feature for cost management in production environments. By caching previous responses, users can significantly reduce costs for repetitive queries. The pricing is designed to encourage adoption of the Flash model for general tasks while reserving the Pro model for complex, high-value reasoning tasks where the extra cost is justified by the quality of the output.

  • Flash Input: $0.14/M (miss), $0.028/M (hit)
  • Flash Output: $0.28/M
  • Pro Input: $1.74/M (miss), $0.145/M (hit)
  • Pro Output: $3.48/M
  • Approx 7x cheaper than Claude Opus 4.7

Use Cases

Developers can utilize these models for a wide range of applications. The JSON output support and Tool Calls capabilities make them ideal for building complex agents and RAG systems. The 1M context window is particularly valuable for enterprise RAG scenarios where long documents need to be summarized or queried. The compatibility with OpenAI and Anthropic API formats ensures easy integration into existing workflows without significant refactoring.

For coding applications, the FIM Completion (Beta) and Chat Prefix Completion (Beta) features allow for advanced code generation and debugging workflows. The support for multimodal capabilities, while primarily text-focused in this release, opens doors for future integration with vision models. These features are best suited for autonomous agents that require tool interaction and long-term memory retention.

  • Coding and Tool Calls
  • Enterprise RAG
  • Agent development
  • Long-context document analysis
  • Beta features: FIM and Chat Prefix Completion

Getting Started

Accessing the model is straightforward for developers. Weights are available on HuggingFace for open-source experimentation. For production, users can access the API endpoints at https://api.deepseek.com and https://api.deepseek.com/anthropic. SDKs are available to streamline integration, and the model supports Beta features like Chat Prefix Completion and FIM Completion. Documentation is provided to assist with API key management and rate limiting configuration.

To get started, developers should first register on the DeepSeek platform to obtain API keys. Once registered, the SDKs can be installed via npm or pip for Python environments. The HuggingFace repository provides the model weights for local deployment, allowing for full control over inference hardware. This dual availability ensures flexibility between cloud and on-premise deployments.

  • API Endpoints: api.deepseek.com
  • Weights on HuggingFace
  • SDKs available for Python and Node.js
  • Beta features: FIM and Chat Prefix Completion

API Pricing β€” Input: Flash: $0.14/M tokens | Pro: $1.74/M tokens / Output: Flash: $0.28/M tokens | Pro: $3.48/M tokens / Context: 1,000,000 tokens (1M)


Sources

DeepSeek API Pricing

Numerama - DeepSeek V4

Le Monde - DeepSeek V4

DeepSeek V4 Tech Report (PDF)

DeepSeek V4 Open Weights

China's DeepSeek releases preview of long-awaited V4 model as AI race intensifies

DeepSeek returns with V4-Pro and V4-Flash, a year after its 'Sputnik moment'