Developer Tools

Apple's MLX Delegate for ExecuTorch Delivers 3-6x GPU Inference on Macs

ExecuTorch MLX Delegate speeds up AI models 3-6x on Apple Silicon Mac GPUs.

Deep Dive

The new MLX Delegate for ExecuTorch brings GPU-accelerated inference to PyTorch models on Apple Silicon Macs. Previously, ExecuTorch users on macOS were limited to CPU-based backends like XNNPACK or the AOTI Metal backend. The delegate compiles models via the standard ExecuTorch pipeline, partitioning the graph and dispatching operations to MLX's optimized Metal GPU kernels. Early benchmarks show 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch backends. It currently supports around 90 ATen ops covering transformer inference needs, including quantized matmul, multi-head attention, rotary position embeddings, mixture-of-experts routing, and recurrent state-space operations.

The delegate integrates directly with the PyTorch 2 export stack using torch.export for graph capture and TorchAO for quantization, making it compatible with all standard ExecuTorch workflows. Supported models include dense transformers (Llama 3.2, Qwen 3, Phi-4 mini, Gemma 3), sparse Mixture-of-Experts (Qwen 3.5 35B-A3B), and speech-to-text models (Whisper, Voxtral, Parakeet) for both offline and real-time transcription. Quantization options span BF16, FP16, FP32, 2/4/8-bit affine, and NVFP4. The delegate is currently experimental and under active development, with APIs subject to change.

Key Points
  • 3-6x throughput improvement over existing ExecuTorch CPU backends on Apple Silicon Macs.
  • Supports ~90 ATen ops including MoE routing, sliding window attention, and state-space operations.
  • Integrates with PyTorch 2 export and TorchAO for multiple quantization formats (BF16, FP16, 2/4/8-bit affine, NVFP4).

Why It Matters

Enables fast, on-device GPU inference for PyTorch models on Macs, critical for local AI apps and privacy.