[P] SoproTTS v1.5: A 135M zero-shot voice cloning TTS model trained for ~$100 on 1 GPU, running ~20× real-time on the CPU
This $100 side project could make professional voice cloning accessible to everyone.
Deep Dive
SoproTTS v1.5 is a new 135M parameter text-to-speech model capable of zero-shot voice cloning. It was trained for only about $100 on a single GPU. The model runs at an impressive 0.05 Real-Time Factor, meaning it's about 20 times faster than real-time on a base MacBook M3 CPU, with a 250ms latency for streaming. The training code is promised to be released soon, opening the door for wider experimentation.
Why It Matters
It dramatically lowers the cost and hardware barrier for creating high-quality, fast synthetic voices.