Research & Papers

[P] MNIST from scratch in Metal (C++)

A hand-coded MNIST neural network in Metal C++ outperforms Apple's own MLX framework on small batches.

Deep Dive

A developer's deep dive into Apple's Metal API has yielded surprising performance results. Abe Leininger built a complete 2-layer Multi-Layer Perceptron (MLP) for the classic MNIST dataset from the ground up using only Metal-cpp, bypassing high-level frameworks like PyTorch. The goal was to understand the real mechanics of GPU programming on Apple Silicon, focusing on API-level details that often get abstracted away. The project's final iteration managed to outperform Apple's own MLX machine learning library on small batch training, a notable achievement for a custom implementation.

The development process involved three key versions: starting with naive Metal kernels, then adding forward/backward passes with better buffer management, and finally consolidating to a single command buffer per batch to minimize costly CPU/GPU synchronization. This last optimization, syncing only once per epoch for loss calculation, proved crucial. The exercise highlights how architectural choices—buffer residency, command buffer structure, and synchronization points—directly impact performance, often more than raw kernel optimization. It serves as a valuable case study for developers looking to squeeze maximum performance from Apple's GPUs and underscores the potential efficiency of well-crafted, low-level code over general-purpose frameworks for specific tasks.

Key Points
  • Custom Metal-cpp implementation of a 2-layer MLP for MNIST trained from scratch.
  • Final version (v3) with single command buffer per batch beat Apple's MLX framework speed on small batches.
  • Project focused on low-level Metal API concepts: buffer residency, command buffers, and CPU/GPU synchronization.

Why It Matters

Shows the tangible performance benefits of mastering low-level GPU APIs, offering a blueprint for optimizing AI on Apple Silicon.