Audio & Speech

DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration

A new chip architecture achieves 86.6% power savings and 60.4% area reduction for edge AI workloads.

Deep Dive

A team of researchers has unveiled the DHFP-PE (Dual-Precision Hybrid Floating Point Processing Element), a novel chip architecture designed to drastically improve the efficiency of AI computations, particularly for edge and mobile devices. The core innovation is a fully pipelined multiply-accumulate (MAC) unit that natively supports the low-precision FP8 (E4M3, E5M2) and FP4 (E2M1, E1M2) formats increasingly used to shrink AI models. Its standout feature is a bit-partitioning technique that allows a single 4-bit multiplier to function either as a standard 4x4 unit for FP8 operations or as two parallel 2x2 multipliers for 2-bit operands, achieving perfect 100% hardware utilization without duplicating logic.

Fabricated using a 28nm process, the DHFP-PE delivers impressive physical results: an operating frequency of 1.94 GHz within a tiny area of just 0.00396 square millimeters and a remarkably low power draw of 2.13 milliwatts. The researchers report that this design achieves up to a 60.4% reduction in silicon area and an 86.6% saving in power consumption compared to current state-of-the-art designs. This level of efficiency is a breakthrough for deploying complex AI models in battery-powered environments like smartphones, drones, and IoT sensors, where every milliwatt and square millimeter counts.

The paper, accepted for the ANRF-sponsored NEleX-2026 conference, positions the DHFP-PE as a foundational hardware building block for the next generation of AI acceleration. By optimizing for the flexible use of ultra-low-precision formats like FP4 and FP8, the architecture directly addresses the hardware bottleneck for running quantized models. This work bridges the gap between algorithmic advances in model compression and the physical silicon needed to execute them efficiently, paving the way for more capable and ubiquitous on-device AI.

Key Points
  • Novel bit-partitioning allows a 4-bit multiplier to operate as 4x4 for FP8 or as two 2x2 for FP4, achieving 100% hardware utilization.
  • Fabricated in 28nm, the chip runs at 1.94 GHz, uses 0.00396 mm² area, and consumes only 2.13 mW of power.
  • Achieves up to 60.4% area reduction and 86.6% power savings versus current designs, enabling complex AI on extreme edge devices.

Why It Matters

This hardware breakthrough could make powerful AI models feasible on smartphones and sensors by drastically cutting power and space needs.