Research & Papers

MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs

New compiler slashes DNN latency by up to 35% on complex edge hardware, unlocking new AI applications.

Deep Dive

A research team from ETH Zurich and the University of Bologna has unveiled MATCHA, a novel framework designed to tackle the complex challenge of running deep neural networks (DNNs) on modern edge computing chips. These System-on-Chips (SoCs) increasingly pack multiple, specialized acceleration engines (like NPUs, GPUs, and DSPs) to handle AI workloads, but existing software often fails to utilize them all efficiently. MATCHA addresses this by generating highly concurrent execution schedules, using advanced constraint programming to optimize both L3 and L2 memory allocation and task scheduling across the different hardware units. This approach, which includes pattern matching, tiling, and intelligent mapping, aims to maximize parallel execution and keep all accelerators busy.

In practical tests, MATCHA delivered significant performance gains. When evaluated on the industry-standard MLPerf Tiny benchmark suite using a SoC with two heterogeneous accelerators, the framework improved overall accelerator utilization and slashed inference latency by up to 35% compared to its predecessor, the MATCH compiler. This leap in efficiency is critical for real-time edge applications, from autonomous drones making split-second decisions to smart sensors processing video feeds on-device. The work, accepted at the prestigious ACM/IEEE Design Automation Conference (DAC26), represents a major step toward fully harnessing the raw, heterogeneous compute power now available at the edge, moving beyond theoretical peak performance to real-world, optimized deployment.

Key Points
  • Cuts inference latency by up to 35% on the MLPerf Tiny benchmark versus the prior MATCH compiler.
  • Uses constraint programming to optimize memory allocation and scheduling across multiple, different accelerators in a single chip.
  • Enables higher parallel execution and utilization of heterogeneous hardware (e.g., NPUs, GPUs) common in modern edge SoCs.

Why It Matters

Unlocks faster, more complex AI directly on devices like phones and robots, reducing cloud dependency and latency.