Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
New hardware design eliminates costly memory rewrites that slow down AI models like BERT and ViT.
A team of researchers has unveiled TrilinearCIM, a groundbreaking hardware architecture designed to accelerate Transformer models like BERT and Vision Transformers (ViT) with unprecedented energy efficiency. The core innovation addresses a major bottleneck: the self-attention mechanism in Transformers creates dynamic data that forces conventional Compute-in-Memory (CIM) accelerators into constant, energy-intensive reprogramming cycles of non-volatile memory (NVM), degrading performance and stressing device durability. TrilinearCIM sidesteps this entirely by using a novel transistor called a Double-Gate Ferroelectric FET (DG-FeFET).
This hardware enables a 'trilinear' multiply-accumulate operation directly in memory, processing the three key operands of attention (Query, Key, Value) without ever needing to rewrite the memory cells during computation. Evaluated on standard AI benchmarks, the architecture running BERT-base outperformed conventional CIM designs on seven out of nine GLUE natural language understanding tasks. For system-level metrics, it achieved up to a 46.6% reduction in energy consumption and a 20.4% improvement in latency compared to standard FeFET-based CIM, at a cost of 37.3% increased chip area.
The work, detailed in a preprint on arXiv, represents a significant leap in specialized AI hardware. By performing the complete attention computation exclusively within NVM cores and eliminating dynamic reprogramming, TrilinearCIM solves a fundamental efficiency problem. This paves the way for more powerful and sustainable AI chips capable of running the large language models and vision transformers that dominate current AI, directly on devices where energy and speed are critical constraints.
- Uses novel Double-Gate FeFET transistors to enable in-memory computation of Transformer attention without runtime NVM reprogramming.
- Achieves up to 46.6% energy reduction and 20.4% latency improvement over conventional FeFET CIM accelerators.
- Outperforms existing designs on 7 of 9 GLUE tasks with BERT-base, validating its effectiveness for real AI workloads.
Why It Matters
Enables faster, more energy-efficient AI chips for next-gen devices, reducing the massive computational cost of models like GPT and Claude.