StepFun's new Step-3.7-Flash model breaks benchmarks in multimodal reasoning and tool use, offering massive performance at a fraction of the cost.

The landscape of 'Flash' models is shifting. For a long time, developers had to choose between the lightning-fast responsiveness of small models and the deep reasoning capabilities of frontier giants. On May 29, 2026, StepFun shattered this dichotomy with the release of Step-3.7-Flash.
Step-3.7-Flash isn't just another incremental update; it is a native multimodal powerhouse designed specifically for the agentic era. By combining high-speed throughput with sophisticated visual understanding and tool-calling reliability, StepFun has delivered a model that feels less like a chatbot and more like a digital coworker capable of navigating complex UIs and executing code in real-time.
At the heart of Step-3.7-Flash lies a highly efficient Sparse Mixture of Experts (MoE) architecture. While the model boasts a massive 198B total parameters, it only activates approximately 11B parameters per token. This 'intelligence density' is what allows the model to maintain high-level reasoning without the massive computational overhead typically associated with large-scale models.
This architectural efficiency translates directly into developer-friendly metrics. The model achieves a staggering throughput of 400 tokens per second, making it ideal for real-time applications and high-concurrency production environments. Furthermore, it supports a massive 256K context window, which includes three distinct reasoning levels to balance speed and depth depending on the complexity of the task.
The numbers speak for themselves. Step-3.7-Flash has claimed the top spot on several critical benchmarks that test the limits of multimodal and agentic intelligence. It ranks #1 on ClawEval-1.1 with a score of 67.1 and holds the #1 position on SimpleVQA Search with a score of 79.2, proving its ability to interpret visual data and search the web with precision.
For developers focused on automation and coding, the results are even more impressive. The model scored a massive 95.3 on the V* Python benchmark and secured #2 on SWE-PRO with a score of 56.3. Perhaps most importantly for agentic reliability, it achieved over 98% on the ΟΒ²-bench across all difficulty levels, ensuring that when the model calls a tool, it does so with near-perfect accuracy.
Unlike models that use separate vision encoders bolted onto a text backbone, Step-3.7-Flash is natively multimodal. This means it doesn't just 'see' an image; it understands the semantic relationship between UI elements, charts, complex documents, and code. This makes it uniquely capable of tasks like 'Look at this dashboard screenshot and write a Python script to replicate this chart.'
The model also features an advanced web and visual search capability. It can browse multiple sources and perform deep follow-up queries to verify information. This makes it a formidable tool for RAG (Retrieval-Augmented Generation) workflows where accuracy and source verification are paramount.
StepFun has optimized the pricing model to be incredibly competitive for high-volume developers. By utilizing a cache-hit mechanism, developers can drastically reduce costs for repetitive prompts or long-context RAG applications. The combination of high throughput and low cost makes it one of the most economically viable models for scaling agentic services.
The pricing structure is transparent and designed to reward efficient prompt engineering through significant discounts on cache hits.
The versatility of Step-3.7-Flash allows it to serve multiple high-impact domains. In software engineering, its high V* Python and SWE-PRO scores make it a premier choice for automated code review, debugging, and boilerplate generation. In the realm of business intelligence, its ability to parse charts and documents allows for automated reporting and data extraction.
For AI engineers building autonomous agents, the 98%+ reliability on ΟΒ²-bench is the headline feature. You can deploy this model into complex loops where it must interact with external APIs, navigate web interfaces, and make decisions based on visual feedback without constant human intervention.
Ready to integrate Step-3.7-Flash into your stack? Because the weights are released under the Apache 2.0 license, you can host the model locally using frameworks like GGUF or NVIDIA NIM for maximum privacy and control. For those preferring a managed service, the model is available via API through providers like OpenRouter.
For local deployment, check Hugging Face for BF16, FP8, and NVFP4 versions, as well as GGUF quantizations to fit various hardware constraints.
API Pricing β Input: $0.20 / 1M tokens / Output: $1.15 / 1M tokens / Context: 256K