Research & Papers

[P] Gemma 4 running on NVIDIA B200 and AMD MI355X from the same inference stack, 15% throughput gain over vLLM on Blackwell

Gemma 4's new 31B and 26B MoE models achieve 15% higher throughput than vLLM on NVIDIA's B200.

Deep Dive

Google DeepMind has launched Gemma 4, introducing two powerful new open-weight models designed for efficiency and long-context performance. The flagship is Gemma 4 31B, a dense model with a 256,000-token context window. Alongside it is the Gemma 4 26B A4B, a Mixture-of-Experts (MoE) model with 26 billion total parameters but only 4 billion active per forward pass, significantly boosting inference efficiency. Both models are natively multimodal, capable of processing text, images, and video with dynamic resolution, marking a substantial architectural redesign focused on quality and scalability.

In a significant technical demonstration, the models were deployed on launch day using Modular's MAX inference stack, running across both NVIDIA's new Blackwell B200 and AMD's MI355X accelerators from a single, unified software stack. Initial benchmarks on the NVIDIA B200 showed a 15% higher output throughput compared to the widely-used vLLM framework. This cross-vendor compatibility and performance gain highlights a move towards hardware-agnostic, optimized inference. Google and Modular have also released a free online playground, allowing developers to test Gemma 4's capabilities without any setup, accelerating experimentation and adoption.

Key Points
  • Google DeepMind released two Gemma 4 models: a dense 31B model and a 26B MoE model with only 4B active parameters.
  • Both models feature a 256K context window and are natively multimodal for text, image, and video input.
  • Running on a unified stack, they achieved 15% higher throughput on NVIDIA's B200 chip compared to vLLM on Blackwell architecture.

Why It Matters

This delivers more efficient, hardware-flexible AI inference, lowering costs and complexity for enterprises deploying large multimodal models.