Audio & Speech

TokenChain: A Discrete Speech Chain via Semantic Token Modeling

New fully discrete speech chain boosts ASR and TTS with semantic token modeling.

Deep Dive

TokenChain, presented by researchers Mingxuan Wang and Satoshi Nakamura at ICASSP 2026, reimagines the machine speech chain concept—originally designed to simulate the human perception-production loop—using a fully discrete token-based architecture. Unlike prior continuous approaches, TokenChain splits the pipeline into three components: a semantic-token-based automatic speech recognition (ASR) model, an autoregressive text-to-semantic (T2S) model jointly trained with ASR, and a masked-generative semantic-to-acoustic model used only for synthesis. This design allows end-to-end feedback through a discrete text interface using straight-through argmax and Gumbel-Softmax approximations, balanced with supervised ASR via dynamic weight averaging. Ablation studies explore optimal temperature schedules for in-domain and cross-domain transfer, highlighting the system’s robustness.

Evaluation on standard benchmarks shows significant gains. On LibriSpeech, TokenChain surpasses baseline accuracy 2–6 epochs earlier and achieves 5–13% lower equal-epoch error with stable T2S performance. On TED-LIUM, the system reduces relative ASR word error rate (WER) by 56% and T2S WER by 31%, all while exhibiting minimal forgetting. These results demonstrate that chain learning remains highly effective even with discrete token interfaces and models. The paper is accepted for publication at ICASSP 2026, signaling strong peer recognition.

Key Points
  • Couples semantic-token ASR with a two-stage TTS (text-to-semantic + semantic-to-acoustic) for joint improvement.
  • Reduces ASR word error rate by 56% and T2S word error rate by 31% on TED-LIUM with minimal forgetting.
  • Converges 2–6 epochs faster than baselines on LibriSpeech, achieving 5–13% lower equal-epoch error.

Why It Matters

TokenChain proves discrete token-based speech loops can dramatically boost ASR/TTS accuracy, paving the way for more efficient voice AI.