Audio & Speech

MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model

This tiny 0.1B model rivals larger systems on speech and voice cloning tasks.

Deep Dive

MiniMind-O is an open small-scale omni model that redefines what’s possible with just 0.1B parameters. Built on the MiniMind language model, it accepts text, speech, and image inputs and outputs both text and streaming speech. The architecture features a full MiniMind backbone as the Thinker, paired with an independent four-layer Talker made from MiniMind blocks. Frozen SenseVoice-Small and SigLIP2 encoders handle speech and image feature extraction, feeding lightweight MLP projectors that inject information at modality-placeholder positions. The Talker reads a middle-layer Thinker state together with an autoregressive eight-layer Mimi-code buffer, while speaker control is managed via a dedicated speaker token, right-aligned reference codec prompts, and precomputed CAM++ speaker embeddings. This design keeps voice conditioning within the audio-code context rather than requiring a separate TTS module.

In consistency evaluations, the dense and MoE variants achieve average CERs of 0.0897 and 0.0900 respectively, with overall voice-cloning similarities of 0.5995 and 0.5937. Beyond reporting a working system, the paper identifies three scale-critical design choices for small omni models: middle-layer semantic bridging, a released multimodal sequence format, and a parameter-efficient eight-codebook interface. All code, checkpoints, and training data—including Parquet datasets for text-to-audio, image-to-text, and audio-to-audio training—are publicly available. This open release allows researchers to inspect the complete interaction loop and build upon it.

Key Points
  • Open-source release includes model code, checkpoints, and Parquet training datasets for text-to-audio, image-to-text, and audio-to-audio.
  • Uses lightweight MLP projectors and an eight-layer Mimi-code buffer for efficient speech generation without a separate TTS module.
  • Achieves competitive voice-cloning similarity (0.5995) and low CER (0.0897) despite only 0.1B parameters.

Why It Matters

Democratizes full-stack multimodal speech-native AI, enabling researchers to inspect and customize voice, text, and vision models at low cost.