Research & Papers

[P] Bypassing CoreML to natively train a 110M Transformer on the Apple Neural Engine (Orion)

Open-source system unlocks Apple's 38 TOPS ANE for on-device LLM training, hitting 170+ tokens/s on M4 Max.

Deep Dive

A developer has open-sourced ORION, a groundbreaking system that bypasses Apple's restrictive CoreML framework to enable direct programming and on-device training of large language models on the Apple Neural Engine (ANE). This work, building on foundational hardware reverse-engineering by 'maderix', addresses the core frustration in Apple's ML stack: CoreML's opaque abstractions prevent direct ANE access and do not support training, leaving the chip's significant compute—up to 38 TOPS (INT8) and ~19 TFLOPS of fp16—largely untapped for LLMs. ORION represents a bridge from raw hardware exploit to a mathematically stable runtime, developed through an 'architectural delegation' approach using Claude AI to generate low-level code.

The technical achievement is substantial, requiring the cataloging of 17 undocumented ANE programming constraints (11 newly discovered) and solving the 'numerical stability ceiling' that caused previous attempts to diverge with 100% NaN rates. ORION's custom compiler lowers a 27-operation graph IR through five optimization passes to emit ANE-native MIL. Key fixes included a deferred compilation pipeline to prevent stale programs, activation clamping to handle fp16 overflow, and strict gradient sanitization. The result is stable multi-step training of a 110M-parameter model and inference performance of 170+ tokens/s for GPT-2 124M on an M4 Max, demonstrating the real leverage of directly programming Apple's specialized silicon.

Key Points
  • Bypasses CoreML to enable direct ANE programming and on-device LLM training, unlocking ~19 TFLOPS of fp16 compute.
  • Solves critical stability bugs (NaN divergence, fp16 overflow, corrupted weights) that halted previous ANE training attempts like ANEgpt.
  • Achieves 170+ tokens/s for GPT-2 124M inference on M4 Max and trains a stable 110M-parameter Transformer model.

Why It Matters

Unlocks Apple's powerful, specialized Neural Engine for on-device AI research and private LLM training, bypassing restrictive vendor software.