I trained a language model on CPU in 1.2 hours with no matrix multiplications — here's what I learned
A 13.6M parameter model uses ternary weights and runs on a 2-thread CPU, achieving a 6.80 validation loss.
Deep Dive
Developer changcheng967 built FlashLM v3, a 13.6M parameter language model. It uses ternary weights ({-1, 0, +1}) and was trained on a 2-thread CPU in just 1.2 hours on 32M tokens. The model learns syntax but not semantics, and the project revealed that 86% of training time was spent on the inefficient output layer. The code is MIT licensed and available on Hugging Face.
Why It Matters
This research highlights bottlenecks in efficient model design and pushes the boundaries of what's possible with minimal, CPU-only hardware.