Developer Tools

Mercury 2: The fastest reasoning LLM, powered by diffusion

Hacker News February 25, 2026

⚡New diffusion architecture enables >5x faster generation, challenging autoregressive LLM bottlenecks for real-time AI.

Deep Dive

Inception has introduced Mercury 2, a new class of language model that replaces traditional autoregressive decoding with a diffusion-based architecture for parallel token generation. This fundamental shift allows the model to produce multiple tokens simultaneously through refinement steps, achieving over 1,000 tokens per second on NVIDIA Blackwell GPUs—more than 5x faster than sequential models. The announcement positions Mercury 2 as a solution for modern "production AI" where latency compounds across agentic loops, retrieval pipelines, and interactive applications, making speed a critical bottleneck for user experience.

Technically, Mercury 2 offers a 128K context window, native tool use, and schema-aligned JSON output at competitive pricing. Its performance curve fundamentally changes the trade-off between reasoning quality and latency, enabling complex reasoning within real-time budgets. Early adopters like Zed, Skyvern, and Happyverse AI report transformative impacts on coding autocomplete, agent workflows, and voice interfaces, where Skyvern's CTO notes it's "at least twice as fast as GPT-5.2." The model's architecture, optimized for p95 latency under high concurrency, suggests a new direction for LLM development focused on parallel rather than sequential processing.

Key Points

Uses diffusion architecture for parallel refinement, generating >5x faster than autoregressive models at 1,009 tokens/sec on NVIDIA Blackwell GPUs.
Priced at $0.25/1M input and $0.75/1M output tokens with 128K context, tool use, and structured JSON output.
Designed for latency-compounding production use cases like coding agents, real-time voice, and multi-step workflows where speed defines feasibility.

Why It Matters

Enables complex reasoning in real-time applications like voice AI and interactive coding, where latency has traditionally forced quality compromises.

Read Original Article

Mercury 2: The fastest reasoning LLM, powered by diffusion

Why It Matters

Stay Ahead in AI