Research & Papers

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

New research shows 2.0x to 8.4x memory reduction for running large AI models on phones.

Deep Dive

Researchers from multiple universities developed FlashMem, a memory streaming framework for mobile GPUs. Instead of preloading all model weights, it statically schedules and dynamically streams them using 2.5D texture memory. In tests on 11 models, it achieved 1.7x to 75.0x speedups and 2.0x to 8.4x memory reduction. This enables large-scale DNNs and multi-model workflows to run efficiently on resource-constrained mobile devices.

Why It Matters

Enables complex AI applications like multi-model agents and large language models to run locally on smartphones, reducing cloud dependency.