Open Source

CPU TTS benchmark: Kokoro 82M vs Supertonic 3 – speed vs quality showdown

Supertonic runs 6x realtime at 2 steps, but Kokoro still sounds more human.

Deep Dive

A Reddit user (gvij) conducted a head-to-head CPU benchmark between two popular TTS models: Kokoro 82M and Supertonic 3. Using an AMD EPYC 7763 with 4 vCPUs, 16GB RAM, and no GPU (roughly comparable to a Ryzen 5600 or N100), they ran 120 timed tests across 6 text lengths (12 to 1712 chars). Supertonic 3, a flow-matching model with adjustable inference steps, showed impressive speed: at 2 steps ("speed mode"), it achieved a mean RTF of 0.165 (6.1x realtime) and just 1.82s wall-clock latency for a 13-second audio clip. At its default 5 steps, RTF was 0.313 (3.2x realtime) with 3.67s latency. Kokoro 82M, the TTS Arena leader, was slower—PyTorch backend gave RTF 0.469 (2.1x realtime) and 5.62s latency; ONNX was unexpectedly slower at 0.509 (2.0x) on this AMD CPU.

Quality is where the ranking flips. Supertonic at 2 steps produced slurred words and mechanical prosody—fine for prototyping, not production. At 5 steps it became genuinely usable. But Kokoro 82M, regardless of backend, delivered the most natural speech of any model in its size class, explaining its top TTS Arena position. The practical takeaway: choose Kokoro for human-like voice output (accept slower speed), Supertonic 5-step for low-latency assistants/chatbots, and Supertonic 2-step only for demos. Two surprises emerged: Kokoro ONNX was slower than PyTorch on AMD hardware (likely due to fixed overhead benefits on longer texts only), and Supertonic has significantly higher per-call overhead, narrowing the gap on very short utterances. The full dataset includes 24 audio samples and benchmark scripts on GitHub, with the author inviting tests on N100 or Raspberry Pi 5 for edge deployment insights.

Key Points
  • Supertonic 3 at 2 steps achieved 6.1x realtime (RTF 0.165) but rough audio; at 5 steps, 3.2x realtime (RTF 0.313) with clean output.
  • Kokoro 82M delivered the most natural speech but was slower: 2.1x realtime on PyTorch (RTF 0.469), 5.62s latency for 196 chars.
  • Surprise: Kokoro ONNX was slower than PyTorch on AMD CPU; Supertonic has higher per-call overhead, reducing speed advantage on short utterances.

Why It Matters

Provides real-world CPU TTS benchmarks for edge deployment, helping developers choose between speed and quality for chatbots and assistants.