FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
New research shows 2.0x to 8.4x memory reduction for running large AI models on phones.
Deep Dive
Researchers from multiple universities developed FlashMem, a memory streaming framework for mobile GPUs. Instead of preloading all model weights, it statically schedules and dynamically streams them using 2.5D texture memory. In tests on 11 models, it achieved 1.7x to 75.0x speedups and 2.0x to 8.4x memory reduction. This enables large-scale DNNs and multi-model workflows to run efficiently on resource-constrained mobile devices.
Why It Matters
Enables complex AI applications like multi-model agents and large language models to run locally on smartphones, reducing cloud dependency.