Audio & Speech

TiCo lets AI voices obey time commands – 2.7x better duration control

Voice assistants can now follow 'speak for 15 seconds' with Spoken Time Markers.

Deep Dive

TiCo (Time‑Controllable Spoken Dialogue Model), developed by researchers from MIT and National Taiwan University, addresses a fundamental gap in voice AI: the inability to control how long a spoken response lasts. Traditional models generate natural dialogue but ignore timing constraints like 'please speak for about 15 seconds.' TiCo solves this by injecting Spoken Time Markers (STMs)—tokens like '<10.6 seconds>'—into its internal generation process. These markers give the model real‑time awareness of elapsed speaking time, allowing it to dynamically adjust the remaining content to meet the requested duration. This time awareness is achieved without modifying the underlying backbone architecture, making TiCo a lightweight add‑on that can be applied to existing spoken dialogue models.

TiCo is post‑trained using a self‑generation pipeline combined with reinforcement learning and a verifiable reward function that penalises duration error. Notably, this training does not require any question‑answer paired data; the model learns purely from self‑generated examples and reward signals. To evaluate time‑controlled instruction following, the team introduces TiCo‑Bench, the first benchmark of its kind. On this benchmark, existing open‑source and commercial models frequently fail to meet explicit time constraints, while TiCo achieves a 2.7× reduction in duration error compared to its backbone model and 1.6× over the best competing baseline. Crucially, this timing precision comes without sacrificing the quality of the generated speech. For voice assistants, interactive agents, and any system where natural conversation pacing matters, TiCo opens the door to more predictable, human‑like interactions.

Key Points
  • TiCo uses Spoken Time Markers (e.g., <10.6 seconds>) inserted during generation to track elapsed time and adjust content accordingly.
  • Post‑trained without any question‑answer paired data, leveraging self‑generation and reinforcement learning with a verifiable reward for duration accuracy.
  • On the new TiCo‑Bench benchmark, TiCo reduces duration error by 2.7× compared to its backbone and 1.6× over the strongest alternative baseline.

Why It Matters

Voice assistants and interactive agents can now control response length, enabling more natural, predictable conversations.