Open Source

Meta announces four new MTIA chips, focussed on inference

Meta's custom silicon hits 27.6 TB/s memory bandwidth, a 4.5x leap, to tackle the LLM inference bottleneck.

Deep Dive

Meta has detailed four new generations of its custom Meta Training and Inference Accelerator (MTIA) chips, developed in a rapid two-year cycle that saw a new chip iteration roughly every six months. The company is taking a modular, chiplet-based approach, allowing components to be swapped without full redesigns. The latest MTIA 450 and 500 models represent a strategic shift to an 'inference-first' architecture, specifically optimized for running generative AI models, in contrast to Nvidia's training-first GPU design. This focus makes sense given Meta's massive scale of serving AI products to billions of users.

A key breakthrough is in memory bandwidth, the primary bottleneck for large language model (LLM) inference. Bandwidth has scaled from 6.1 TB/s on the MTIA 300 to a massive 27.6 TB/s on the MTIA 500—a 4.5x increase. Meta claims the MTIA 450 already outperforms leading commercial products in this critical metric. The chips also push heavily on low-precision compute, with the MX4 data type on the MTIA 500 hitting 30 PFLOPS while reportedly preserving model quality. For developers, integration is streamlined with PyTorch-native support, including torch.compile and a vLLM plugin, enabling models to run on both GPUs and MTIA hardware without code rewrites.

The MTIA 400 is currently heading to Meta's data centers, with the more advanced 450 and 500 versions slated for production in 2027. This aggressive roadmap underscores Meta's commitment to reducing its reliance on external silicon vendors like Nvidia and controlling its own AI infrastructure destiny. By building hardware tailored to its specific inference workloads and software stack, Meta aims to achieve greater efficiency and cost savings at the unprecedented scale required for its family of apps and AI services.

Key Points
  • Inference-first design: MTIA 450/500 are optimized specifically for running GenAI models, unlike Nvidia's training-first GPUs.
  • Massive bandwidth leap: Memory bandwidth scaled 4.5x from 6.1 TB/s (MTIA 300) to 27.6 TB/s (MTIA 500), targeting the LLM inference bottleneck.
  • PyTorch-native integration: Full support for torch.compile and a vLLM plugin allows models to run on GPUs or MTIA without code changes.

Why It Matters

This move reduces Meta's dependence on Nvidia, lowers AI inference costs at scale, and could reshape the data center chip market.