Open Source

I built a transformer in C++17 from scratch — no PyTorch, no BLAS, no dependencies. Trains on CPU. 0.83M params, full analytical backprop, 76 min to val loss 1.64.

r/LocalLLaMA May 02, 2026

⚡0.83M parameters, full analytical backprop, trains on a single CPU core in 76 minutes.

Deep Dive

Developer Eamon2009 (Reddit user Suspicious_Gap1121) released Quadtrix.cpp, a fully functional GPT-style transformer written entirely in C++17 with zero external dependencies — no PyTorch, no LibTorch, no BLAS, no automatic differentiation libraries. The only dependency is the C++17 standard library and POSIX sockets. The model features 4 layers, 4 attention heads, and a 200-dimensional embedding space, totaling 826,985 parameters (0.83M). It uses a context window of 128 characters and was trained on a 31.4M-character corpus of children's stories for 76.2 minutes on a single CPU core, achieving a validation loss of 1.6371 nats. The implementation includes a lightweight CPU float tensor library (2D/3D, row-major), token and position embeddings, LayerNorm, multi-head causal self-attention, feed-forward blocks, and a complete analytical backward pass with explicit gradient derivations for every operator. The AdamW optimizer with bias correction is also hand-written. The developer notes that the LayerNorm backward pass required a full 3-term formula and saving intermediate values, while the attention backward needed careful handling of dropout masks. OpenMP parallelization across all CPU cores achieves a 5-7x speedup on an 8-core machine.

The project's output after training is largely gibberish — as the developer acknowledges, it's a 0.83M-parameter model trained for just over an hour on CPU. However, the achievement is in the depth of implementation: every single gradient is derived and coded manually, with no reliance on autograd. A separate LibTorch GPU port using the same architecture runs approximately 75x faster on an RTX 3080, but notably requires the entire 600-line backward.h to be deleted because CUDA autograd handles it automatically. This project serves as a powerful educational resource for those wanting to understand transformer internals at the code level, though its practical utility is limited due to the tiny model size and training duration. The repository is available on GitHub under the name Quadtrix.cpp.

Key Points

0.83M-parameter GPT-style transformer implemented entirely in C++17 with zero external dependencies — no PyTorch, BLAS, or auto-diff libraries.
Trained on a single CPU core in 76 minutes achieving val loss 1.6371; OpenMP parallelization provides 5-7x speedup on 8 cores.
Includes hand-written tensor library, forward/backward passes with analytical gradients, AdamW optimizer, and dropout — all from scratch.

Why It Matters

Demonstrates complete mastery of transformer internals; serves as a rare educational resource for low-level ML engineering.

Read Original Article

I built a transformer in C++17 from scratch — no PyTorch, no BLAS, no dependencies. Trains on CPU. 0.83M params, full analytical backprop, 76 min to val loss 1.64.

Why It Matters

Stay Ahead in AI