Viral Wire

StepFun's Step 3.7 Flash boosts agentic coding with 11B active parameters and multimodal vision

Open-source 198B MoE model matches Claude Opus 4.6 at one-ninth the cost.

Deep Dive

StepFun's new Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model designed for agentic workflows. It uses a 196B language backbone paired with a 1.8B ViT encoder, activating only ~11B parameters per token for efficient inference. Key specs include a 256k token context window, up to 400 tokens/sec throughput, and three selectable reasoning depths (low, medium, high) to balance latency and quality. The model significantly improves coding agent performance: SWE-Bench Pro 56.26% (up 5 points from Step 3.5 Flash), Terminal-Bench 59.55%, and SWE-MTLG 72.42%.

Step 3.7 Flash introduces Advisor Mode, implementing Anthropic's advisor strategy where the model runs the agentic loop end-to-end, escalating only for planning or failure recovery. StepFun reports that with Advisor Mode, Step 3.7 Flash reaches 97% of Claude Opus 4.6 coding performance on SWE-Bench Verified at roughly one-ninth the per-task cost ($0.19 vs $1.76). The model also supports multimodal vision via two tool pathways: Visual Search Tool (for recognition) and Python Tool (for fine-grained analysis). StepFun observed emergent compositional tool use—like generating frontend code then using GUI rendering to iterate—without explicit training. On Android Daily benchmark, it scores 61.87%, second only to Gemini 3 Flash.

Key Points
  • 198B MoE model with 11B active parameters, 256k context, and Apache 2.0 license
  • Advisor Mode achieves 97% of Claude Opus 4.6 coding performance at $0.19 per task
  • Multimodal with visual search and Python tools; emergent tool composition observed

Why It Matters

Open-source model offers near-frontier coding agent performance at a fraction of the cost, enabling affordable AI assistants.