Audio & Speech

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

New architecture lets voice AI access external data mid-conversation without breaking natural flow.

Deep Dive

A team from Kyutai and collaborating institutions has published a research paper detailing MoshiRAG, a novel architecture designed to solve the factuality problem in real-time, conversational voice AI. Full-duplex speech models like Kyutai's own Moshi can handle natural conversation elements like interruptions and backchannels, but often struggle with accurate knowledge. Scaling these models for better factuality would make real-time inference too computationally expensive. MoshiRAG addresses this by decoupling the conversational interface from the knowledge source.

The system employs an asynchronous RAG (retrieval-augmented generation) framework. It identifies when a user's query requires external knowledge and initiates a retrieval process during the natural temporal gap between the start of a response and the delivery of its core information. This allows the compact, real-time conversational model to access powerful, up-to-date external databases or larger language models without breaking the flow of dialogue. The researchers report that MoshiRAG achieves factuality on par with the best publicly available non-duplex speech models while preserving full-duplex interactivity.

A key advantage of the modular design is its flexibility. Different retrieval methods can be plugged in without needing to retrain the core speech model. The paper also demonstrates that the system performs well on out-of-domain tasks like mathematical reasoning, suggesting broader applicability beyond general knowledge Q&A. This represents a significant step toward voice assistants that are both highly responsive and reliably accurate.

Key Points
  • Uses asynchronous RAG to fetch external data during natural conversation pauses, maintaining real-time flow.
  • Achieves factuality comparable to larger, non-real-time models while keeping the interface compact and efficient.
  • Modular design allows for plug-and-play retrieval systems without retraining the core speech model.

Why It Matters

Enables the next generation of voice assistants that are both instantly responsive and factually reliable, crucial for professional and educational use.