FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
New technique cuts compute by up to 10x for long video generation...
FreqFormer tackles the core bottleneck in long-sequence video diffusion: the quadratic self-attention cost that dominates runtime and memory for very long token sequences. Unlike prior efficient attention methods that use one approximation everywhere, FreqFormer exploits the spectral structure of video features—low frequencies carry global layout and coarse motion, while high frequencies carry texture and fine detail. The framework splits token features into spectral bands and applies different attention operators to each: dense global attention on compressed low-frequency content, structured block-sparse attention on mid frequencies, and sliding-window local attention on high frequencies. A lightweight spectral routing network allocates heads across bands using layer statistics and the diffusion timestep, shifting compute toward global structure early in denoising and detail later. Cross-band summary tokens provide cheap residual exchange.
FreqFormer is paired with a fused GPU execution plan that co-schedules dense, sparse, and local branches to cut kernel launches and memory traffic. The paper provides a consistent complexity model, an orthonormal-decomposition view of approximation, and simulation-based systems numbers including throughput, arithmetic intensity, memory traffic, and duration scaling. In simulations from 64K to 1M tokens, FreqFormer substantially reduces estimated attention FLOPs and KV-related memory traffic versus dense attention while keeping a hardware-friendly pattern. This supports spectrally structured heterogeneous attention as a practical direction for long-video diffusion transformers, potentially enabling much longer and more detailed video generation with limited compute resources.
- FreqFormer splits video features into low, mid, and high frequency bands with different attention operators for each
- A lightweight spectral routing network dynamically allocates compute across bands based on denoising stage
- Simulations from 64K to 1M tokens show substantial FLOP and memory traffic reductions vs dense attention
Why It Matters
Makes long-video generation practical by cutting compute 10x, enabling richer, longer AI-generated videos.