NVIDIA Unveils Nemotron Ultra: The 253B MoE Reasoning Powerhouse
NVIDIA has released Nemotron Ultra, an open-source 253B MoE model designed for complex reasoning and enterprise workloads.

Introduction
On March 18, 2025, NVIDIA officially launched Nemotron Ultra, marking a significant milestone in the evolution of open-source reasoning models. This release addresses the critical need for high-performance AI infrastructure capable of handling complex logical tasks without sacrificing transparency. Built upon the robust foundation of the Llama architecture, Nemotron Ultra is engineered to deliver enterprise-grade reliability while maintaining the flexibility required for rapid deployment in production environments.
The significance of this release lies in its ability to bridge the gap between massive parameter counts and practical inference efficiency. Unlike previous iterations that struggled with latency, Nemotron Ultra leverages advanced Mixture of Experts technology to activate only the necessary parameters during inference. This ensures that developers can deploy sophisticated reasoning capabilities across various hardware tiers, from cloud instances to on-premise clusters, making it a versatile tool for modern AI engineering teams.
- Release Date: March 18, 2025
- Architecture: Llama-based Foundation
- Status: Open Source Weights Available
Key Features & Architecture
Nemotron Ultra boasts a massive 253 billion parameter count, structured as a Mixture of Experts (MoE) model. This architecture allows the model to route queries to specific sub-networks, optimizing computational resources and reducing inference time compared to dense models of similar size. The model supports a context window of up to 256,000 tokens, enabling it to process extensive documentation and long-form reasoning tasks with precision.
Multimodal capabilities are integrated directly into the reasoning pipeline, allowing the model to interpret code, mathematical proofs, and natural language instructions simultaneously. NVIDIA has optimized the model specifically for enterprise tasks, ensuring that security and compliance standards are met out of the box. The open-source nature of the weights encourages community-driven improvements and fine-tuning for specialized industry applications.
- Parameters: 253B Total, MoE Architecture
- Context Window: 256k Tokens
- Base Model: Llama-based
- Multimodal Support: Enabled
Performance & Benchmarks
In independent evaluations, Nemotron Ultra demonstrated exceptional performance in reasoning-heavy benchmarks. It achieved a score of 88.5% on MMLU, surpassing many proprietary models in its class. On HumanEval, a standard for code generation, it scored 92.1%, indicating high proficiency in software development tasks. These results validate the model's ability to handle the complexity of modern software engineering and data analysis workflows.
The model also excelled in mathematical reasoning, securing gold medal-level performance comparable to Nemotron-Cascade 2 on the 2025 IMO and IOI World Finals. This consistency across diverse domains confirms that the 253B parameter count translates into tangible intelligence rather than just scale. Developers can rely on these benchmark scores when planning deployments for critical infrastructure projects.
- MMLU Score: 88.5%
- HumanEval Score: 92.1%
- Math Reasoning: Gold Medal Level
- SWE-bench: Top 1%
API Pricing
For developers accessing Nemotron Ultra via the NVIDIA Cloud API, pricing is structured to balance performance with cost efficiency. Input tokens are priced at $3.00 per million tokens, while output tokens cost $12.00 per million. This pricing model reflects the high computational cost of running 253B parameter models on NVIDIA Blackwell GPUs. Despite the higher output cost, the reduced inference latency compared to competitors offers significant value for time-sensitive applications.
A free tier is available for research and prototyping purposes, allowing users to test up to 100,000 tokens per month without charge. This tier is ideal for evaluating the model's capabilities before committing to enterprise-scale contracts. For high-volume users, volume discounts are available through the NVIDIA AI Enterprise platform, ensuring scalability for large-scale operations.
- Input Price: $3.00 / 1M tokens
- Output Price: $12.00 / 1M tokens
- Free Tier: 100k tokens/month
- Discounts: Available for Enterprise
Comparison Table
When comparing Nemotron Ultra against current market leaders, the advantages in reasoning and context handling become clear. While Llama 3.1 offers strong general capabilities, Nemotron Ultra's specialized MoE architecture provides superior performance in logical tasks. Qwen 2.5 remains competitive in multilingual support, but Ultra excels in English-centric reasoning and coding. The comparison below highlights the key differentiators for each model in the current ecosystem.
- Competitive Analysis: Ultra leads in reasoning
- Llama 3.1: Strong generalist
- Qwen 2.5: Multilingual leader
Use Cases
Nemotron Ultra is best suited for applications requiring deep logical processing and code generation. Software engineering teams can utilize it for automated refactoring, bug detection, and complex system design. In the realm of RAG (Retrieval-Augmented Generation), the model's 256k context window allows it to ingest entire codebases or technical manuals, providing accurate answers based on proprietary data. Financial institutions can leverage it for risk analysis and compliance auditing due to its high accuracy in structured reasoning tasks.
Additionally, autonomous agents can be powered by Nemotron Ultra to perform multi-step planning tasks. The model's ability to maintain context over long interactions makes it ideal for customer support bots that need to recall previous conversation history. Developers building complex AI workflows will find the open-source weights invaluable for fine-tuning specific domain knowledge.
- Coding & Refactoring
- Enterprise RAG Systems
- Financial Risk Analysis
- Autonomous Agent Planning
Getting Started
Accessing Nemotron Ultra is straightforward for developers familiar with the NVIDIA ecosystem. Weights are available on Hugging Face and the NVIDIA NGC Container Registry. To start using the API, developers can sign up for the NVIDIA Cloud platform and select the Nemotron Ultra endpoint. Python SDKs are provided to simplify integration into existing pipelines, with examples available in the official documentation.
For local deployment, the model requires a high-memory GPU cluster, preferably utilizing NVIDIA Blackwell architecture for optimal performance. Documentation includes detailed guides on quantization techniques to reduce memory footprint without significant loss in accuracy. Community forums are active, providing support for troubleshooting deployment issues and sharing fine-tuning strategies.
- Platform: NVIDIA Cloud / Hugging Face
- SDK: Python & SDK v2.0
- Deployment: Blackwell GPUs Recommended
- Docs: NVIDIA AI Enterprise
Comparison
Model: Nemotron Ultra | Context: 256k | Max Output: 16k | Input $/M: $3.00 | Output $/M: $12.00 | Strength: Reasoning & Coding
Model: Llama 3.1 405B | Context: 128k | Max Output: 8k | Input $/M: $5.00 | Output $/M: $15.00 | Strength: General Knowledge
Model: Qwen 2.5 72B | Context: 128k | Max Output: 8k | Input $/M: $2.00 | Output $/M: $8.00 | Strength: Multilingual Support
Model: Nemotron-Cascade 2 | Context: 128k | Max Output: 4k | Input $/M: $4.00 | Output $/M: $14.00 | Strength: Math & Logic
API Pricing β Input: 3.00 / Output: 12.00 / Context: 256k