HSA: New method cuts video generation steps by 75% without quality loss
Why give every token 40 steps when most don't need them?
Diffusion Transformers (DiTs) are state-of-the-art for video generation but suffer from immense computational costs because they apply the same 40-step denoising process to every token. A new paper from Ernie Chu and Vishal M. Patel challenges that assumption with Heterogeneous Step Allocation (HSA). The key insight: human vision ignores redundant motion, so models should too. HSA assigns different step budgets to each spatiotemporal token based on its velocity dynamics. Tokens in low-motion areas (backgrounds) receive far fewer steps, while moving objects get the full schedule.
To handle the resulting sequence-length mismatch, HSA introduces a KV-cache synchronization mechanism—active tokens attend to the full sequence but bypass inactive ones entirely. A cached Euler update advances skipped tokens' latent states in one shot without extra model evaluations. Tested on Wan-2 and LTX-2 for both text-to-video and image-to-video tasks, HSA achieves a superior quality-runtime Pareto frontier, especially at 50% and 25% of original runtime. It requires no expensive offline profiling, making it a practical drop-in acceleration for existing DiT pipelines.
- Training-free method reduces DiT inference steps to 50% or 25% while maintaining quality
- Assigns step budgets based on each token's velocity dynamics—low-motion tokens skip steps
- KV-cache sync and cached Euler update let inactive tokens be bypassed without losing context
Why It Matters
Real-time video generation becomes practical for edge devices and production pipelines without sacrificing output quality.