Audio & Speech

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

New energy-scoring method generates audio from text in a single step, matching diffusion quality.

Deep Dive

Autoregressive (AR) models with diffusion heads have become the standard for text-to-audio synthesis, but their iterative decoding and multi-step sampling introduce high latency. To solve this, a team of researchers from National Taiwan University, Amazon, and other institutions developed a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head directly maps Gaussian noise to audio latents in a single step, eliminating the need for recursive diffusion sampling. Distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training, ensuring the output quality remains high despite the speedup.

On the AudioCaps benchmark, the new method consistently surpasses previous one-step approaches such as ConsistencyTTA, SoundCTM, AudioLCM, and AudioTurbo on both objective and subjective metrics. More importantly, it substantially narrows the quality gap to AR diffusion systems that use multi-step sampling. Compared to the current state-of-the-art AR diffusion system IMPACT, this approach achieves up to 8.5x faster batch inference while delivering highly competitive audio quality. The results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis, potentially enabling real-time applications like sound design, gaming, and assistive technologies.

Key Points
  • Energy-scoring head maps Gaussian noise to audio latents in one step, removing iterative diffusion overhead.
  • Combines representation distillation from a masked autoregressive model to preserve conditioning quality.
  • Achieves up to 8.5x faster batch inference than the state-of-the-art IMPACT system with competitive audio quality.
  • Outperforms prior one-step baselines (ConsistencyTTA, SoundCTM, AudioLCM, AudioTurbo) on AudioCaps.

Why It Matters

Near-instant audio generation from text enables real-time sound design, accessibility tools, and low-latency creative workflows.