Stylus uses image diffusion models for training-free music style transfer
A training-free framework beats state-of-the-art with 34% better content preservation
Music style transfer—blending the structure of one track with the style of another—has long required either coarse text descriptions or expensive task-specific training. Researchers from NYU, Brookhaven National Lab, and Seoul National University now introduce Stylus, a training-free framework that repurposes pretrained image diffusion models for this task. By converting audio into mel-spectrograms (visual representations of sound over time), Stylus treats music as structured images and manipulates self-attention layers: it injects style keys and values from a reference track while preserving the structural queries of the source. To ensure high fidelity, the team implements a phase-preserving reconstruction strategy that mitigates spectrogram inversion artifacts, plus classifier-free-guidance-inspired control for adjustable stylization strength.
Extensive evaluations involving 2,925 human ratings show Stylus significantly outperforms existing zero-shot methods—achieving 34.1% higher content preservation and 25.7% better perceptual quality. The work validates that generic image priors can be effectively leveraged for structured audio transformation without any fine-tuning. Accepted at ICIP 2026, Stylus opens the door to personalized music creation without the need for large training datasets or specialist hardware, making style transfer accessible to a wider audience.
- Stylus repurposes pretrained image diffusion models for music style transfer without any training or fine-tuning.
- Achieves 34.1% higher content preservation and 25.7% better perceptual quality over state-of-the-art baselines.
- Uses phase-preserving reconstruction and classifier-free-guidance-inspired control to avoid audio artifacts and adjust stylization.
Why It Matters
Enables high-quality, personalized music style transfer without expensive training or coarse text descriptions.