FlashMem framework speeds mobile AI up to 75x by optimizing GPU memory
New research shows 2.0x to 8.4x memory reduction for running large AI models on phones.
Researchers from multiple universities developed FlashMem, a memory streaming framework for mobile GPUs. Instead of preloading all model weights, it statically schedules and dynamically streams them using 2.5D texture memory. In tests on 11 models, it achieved 1.7x to 75.0x speedups and 2.0x to 8.4x memory reduction. This enables large-scale DNNs and multi-model workflows to run efficiently on resource-constrained mobile devices.
Why It Matters
Enables complex AI applications like multi-model agents and large language models to run locally on smartphones, reducing cloud dependency.