WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
New research from Hanna Lee et al. achieves near-constant latency for text-to-speech models.
A research team led by Hanna Lee has introduced WAND (Windowed Attention and Knowledge Distillation), a novel framework designed to solve the scalability problem in modern text-to-speech (TTS) models. Current autoregressive TTS models, which generate high-quality speech one token at a time, suffer from quadratic scaling in memory and compute costs due to their reliance on full self-attention over the entire sequence. WAND tackles this by splitting the attention mechanism: it uses persistent global attention for the initial text conditioning tokens and a local sliding-window attention for the generated audio tokens. This architectural shift changes the computational complexity from quadratic to constant relative to sequence length.
To implement WAND without sacrificing the output quality of pre-trained models, the team employs a two-pronged training strategy. First, they use curriculum learning, gradually tightening the attention window during fine-tuning to stabilize the process. Second, they apply knowledge distillation, where a smaller WAND-equipped student model learns to mimic the outputs of a larger, full-attention teacher model. This combination allows for high data efficiency and quality recovery. Evaluated on three modern AR-TTS architectures, WAND maintained the original synthesis fidelity while delivering up to a 66.2% reduction in key-value (KV) cache memory and achieving near-constant, predictable latency per generation step, regardless of audio length.
The implications are significant for deploying TTS in resource-constrained environments like mobile devices or for generating long-form audio such as audiobooks or podcasts. By decoupling latency from output length, WAND enables real-time, high-quality speech synthesis that was previously bottlenecked by memory constraints. The paper, submitted to Interspeech 2026, represents a major step toward efficient generative audio models without the traditional trade-offs in quality.
- Hybrid attention mechanism splits global text attention and local audio windowing, enabling constant compute complexity.
- Achieves up to 66.2% KV cache memory reduction and length-invariant latency while preserving original model quality.
- Uses curriculum learning and knowledge distillation for stable, data-efficient fine-tuning of existing pretrained TTS models.
Why It Matters
Enables real-time, high-fidelity speech synthesis on edge devices and for long-form content by breaking the memory-latency bottleneck.