Dual-precision (4/8-bit) SIMD MAC unit with MSDF shift-and-add eliminates multiplier overhead, achieving 4x throughput at same hardware cost?

Dual-precision (4/8-bit) SIMD MAC unit with MSDF shift-and-add eliminates multiplier overhead, achieving 4x throughput at same hardware cost.

SHARP pruning yields ~50% structured sparsity; 3x3 kernel runs in 1 cycle (FxP4) vs. 9 cycles (FxP8) for 9x latency reduction?

SHARP pruning yields ~50% structured sparsity; 3x3 kernel runs in 1 cycle (FxP4) vs. 9 cycles (FxP8) for 9x latency reduction.

100-unit 1D SIMD array with CORDIC-based reconfigurable activation (Sigmoid, Tanh, ReLU) minimizes area and control complexity for edge vision?

100-unit 1D SIMD array with CORDIC-based reconfigurable activation (Sigmoid, Tanh, ReLU) minimizes area and control complexity for edge vision.

Image & Video

TREA edge AI chip cuts latency 9x with dual-precision SIMD

arXiv eess.IV May 11, 2026

⚡New 4/8-bit accelerator runs convolutions 9x faster on edge devices.

Deep Dive

TREA, a new edge-AI accelerator from a research team led by Vijay Pratap Sharma, tackles the stringent area-power-latency constraints of edge vision systems. Its core innovation is the DQ-MAC unit, which performs most-significant-digit-first (MSDF) shift-and-add computation with run-time bit truncation, eliminating conventional multiplier overhead and reducing accumulator bit-width. By supporting 4x FxP4 or 1x FxP8 operations per cycle, it achieves up to 4x throughput improvement without hardware duplication. Co-designed with the SIMD datapath, the SHARP pruning strategy enables near 50% structured sparsity while maintaining full MAC utilization. This allows a 3x3 convolution kernel to be computed in 1 cycle in FxP4 mode (vs. 9 cycles in FxP8) and a 5x5 kernel in 3 cycles (vs. 25 cycles), yielding up to 9x latency reduction at the kernel level. The accelerator also integrates a reconfigurable CORDIC-based nonlinear activation function core supporting Sigmoid, Tanh, and ReLU with hardware reuse through time-multiplexing. The full architecture uses a 1D array of 100 SIMD DQ-MAC units with layer-wise hardware reuse, minimizing area and control complexity.

Experimental results demonstrate substantial improvements in latency, hardware utilization, and energy efficiency compared to conventional fixed-precision and non-reconfigurable accelerators. TREA is designed for real-time edge vision workloads like object detection and classification, targeting applications such as autonomous drones, smart cameras, and industrial IoT. By leveraging low-precision computation and structured sparsity, it overcomes the trade-off between accuracy and efficiency. The paper is currently under review at TVLSI and presents a promising direction for deploying deep neural networks on resource-constrained edge devices, potentially enabling high-performance AI inference without relying on cloud connectivity.

Key Points

Dual-precision (4/8-bit) SIMD MAC unit with MSDF shift-and-add eliminates multiplier overhead, achieving 4x throughput at same hardware cost.
SHARP pruning yields ~50% structured sparsity; 3x3 kernel runs in 1 cycle (FxP4) vs. 9 cycles (FxP8) for 9x latency reduction.
100-unit 1D SIMD array with CORDIC-based reconfigurable activation (Sigmoid, Tanh, ReLU) minimizes area and control complexity for edge vision.

Why It Matters

Brings real-time AI inference to power-starved edge devices, enabling autonomous drones and smart cameras without cloud lag.

Read Original Article

TREA edge AI chip cuts latency 9x with dual-precision SIMD

Why It Matters

Related Articles

🚀 Stay Ahead in AI