BASIS: Balanced Activation Sketching with Invariant Scalars for "Ghost Backpropagation"
New 'ghost backpropagation' method reduces activation memory footprint, enabling deeper models without catastrophic variance.
Vladimer Khasia has published a research paper introducing BASIS (Balanced Activation Sketching with Invariant Scalars), a novel algorithm designed to tackle the fundamental memory bottleneck in training large neural networks. Traditional backpropagation requires storing all intermediate activations, creating an O(L * B * N) memory burden that scales with network depth (L), batch size (B), and feature dimensions (N). This has historically throttled the ability to scale models deeper or train with longer sequences. BASIS proposes a 'ghost backpropagation' approach that fully decouples this activation memory from the problematic B and N dimensions.
The core innovation lies in BASIS propagating the exact error signal (dX) to maintain flawless gradient flow for learning, but computing the critical weight updates (dW) using massively compressed, low-rank tensor sketches. To solve the instability that has plagued previous sketched gradient methods, Khasia developed two key mechanisms: 'Balanced Hashing' to eliminate off-diagonal collision variance, and 'Invariant Scalars' to deterministically preserve the exact energy norm of the spatial geometry. Theoretically, this reduces activation memory to O(L * R * N), where R is a small compression rank.
Empirical validation is compelling. Training a GPT-style architecture for 50,000 steps showed that with a compression rank of just R=32, BASIS achieved parity with—and marginally outperformed—exact backpropagation, yielding a validation loss of 6.575 versus 6.616. The method acted as an implicit regularizer. Remarkably, the algorithm's stability allowed the model to converge smoothly even under extreme compression (R=1), demonstrating the robustness of the estimator. This breakthrough could significantly lower the hardware barriers for training the next generation of massive AI models.
- Reduces activation memory from O(L*B*N) to O(L*R*N), decoupling it from batch/sequence size.
- Uses 'Balanced Hashing' and 'Invariant Scalars' to stabilize sketched gradients, avoiding catastrophic variance.
- Empirically matches/exceeds exact backpropagation performance on a GPT model (6.575 loss at R=32) and works even at R=1.
Why It Matters
Lowers the massive memory cost of training state-of-the-art AI models, enabling deeper networks and longer contexts on existing hardware.