Audio & Speech

FiPA-SR restores high-frequency audio 60x faster than AudioSR

A new GAN-based model uses 3x less GPU memory while outperforming diffusion baselines across multiple bandwidths.

Deep Dive

FiPA-SR, developed by Wallace Abreu and Luiz W.P. Biscainho, is a new GAN-based perceptual architecture for audio super-resolution (bandwidth extension). The model builds on the AEROMamba_P framework and incorporates FiLM (Feature-wise Linear Modulation) layers to condition the reconstruction process on the specific input bandwidth. This allows a single trained model to handle multiple input sampling rates—8, 20, and 32 kHz—without needing separate models for each. Tested on the MUSDB dataset, FiPA-SR consistently outperforms the state-of-the-art diffusion-based AudioSR model across all tested bandwidths, producing higher quality high-frequency content.

Beyond quality, FiPA-SR achieves remarkable efficiency gains: it consumes approximately 3x less GPU memory than AudioSR and performs inference more than 60x faster. This speed leap moves audio super-resolution from offline processing to near real-time applications, even on modest hardware. The paper is submitted to the XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026) and is available on arXiv. While still a research paper, the practical implications are clear—FiPA-SR could enable live restoration of low-bandwidth audio (e.g., telephone calls, compressed streams) on edge devices, dramatically expanding the reach of high-fidelity audio enhancement.

Key Points
  • FiPA-SR uses FiLM layers to adapt to input bandwidths (8, 20, 32 kHz) in a single model, outperforming AudioSR on all tested rates.
  • Achieves 60x faster inference than diffusion-based AudioSR while using 3x less GPU memory.
  • Built on the AEROMamba_P framework, optimized for perceptual quality via GAN training on the MUSDB dataset.

Why It Matters

Real-time audio super-resolution on limited hardware unlocks high-quality voice and music from low-bandwidth recordings.