Reverse engineered Apple Neural Engine(ANE) to train Microgpt
Bypassed Apple's CoreML to unlock 38 TFLOPS of private NPU compute, achieving 6.6 TFLOPS/watt efficiency.
A developer has successfully reverse-engineered Apple's proprietary Neural Engine (ANE) to train a 110M parameter MicroGPT model, bypassing Apple's recommended CoreML framework to directly access the NPU's compute. Using Claude to decode private APIs, the project unlocks the M4 chip's Neural Engine—a component Apple treats as a 'black box'—revealing its 38 TFLOPS of claimed INT8 performance (though actual FP16 compute is roughly half that). This work was driven by a desire to leverage the Mac Mini M4's specialized AI hardware beyond standard GPU-based Metal training, opening a new frontier for on-device model development.
The technical achievement lies in creating a bespoke training pipeline that taps into the ANE's extreme power efficiency: peak compute consumes just 2.8W, yielding an unprecedented 6.6 TFLOPS per watt. This efficiency dramatically outpaces Apple's Metal GPU (~1 TFLOPS/W) and even Nvidia's H100 (1.4 TFLOPS/W). While a single ANE can't train massive models, it's capable of LoRA fine-tuning for 3B or 7B parameter models. The work, shared on GitHub, suggests future potential for clusters of Apple Silicon devices to form highly efficient, low-power training environments, challenging the assumption that serious AI training requires server-grade GPUs.
- Reverse-engineered Apple's private Neural Engine APIs using Claude, bypassing the CoreML framework to access raw NPU compute.
- Achieved 6.6 TFLOPS/watt efficiency (2.8W power draw for 19 TFLOPS FP16), vastly outperforming Metal GPU and H100.
- Created a custom pipeline to train a 110M MicroGPT model, enabling LoRA fine-tuning for 3B/7B models on a single device.
Why It Matters
Unlocks ultra-efficient, on-device AI training on consumer Apple hardware, potentially reducing costs and energy use for model development.