Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
Researchers' new framework adds speech understanding to AI models like LLaVA with minimal data and compute.
A team of researchers has published a paper on Speech-Omni-Lite, a novel framework designed to efficiently equip existing Vision-Language (VL) models with speech understanding and generation. The core innovation is its lightweight, plug-and-play architecture. Instead of retraining massive multimodal models from scratch, Speech-Omni-Lite attaches two small, trainable modules—a speech projector and a speech token generator—to a frozen VL backbone (like LLaVA or GPT-4V). This approach preserves the model's original visual and textual reasoning while adding a new speech modality at a fraction of the typical computational cost.
A key challenge in training such systems is the scarcity of high-quality spoken question-answer datasets. The researchers devised a low-cost data construction strategy to generate synthetic Question-Text Answer-Text-Speech (QTATS) data from readily available Automatic Speech Recognition (ASR) speech-text pairs. Remarkably, experimental results show that with only thousands of hours of this synthesized speech data for training, Speech-Omni-Lite achieves spoken QA performance on par with much larger 'omni-models' trained on millions of hours. Furthermore, the learned speech modules demonstrate strong transferability, meaning they can be effectively applied across different VL backbones, enhancing the portability and utility of the technology for various applications.
- Uses two lightweight plug-in modules to add speech to frozen vision-language models, drastically reducing compute needs.
- Generates synthetic training data (QTATS) from ASR pairs, requiring only thousands of hours of speech data for effective training.
- Achieves spoken QA performance comparable to models trained on millions of hours, with modules transferable across different model backbones.
Why It Matters
Enables affordable, portable voice interfaces for multimodal AI, making advanced assistants and interactive applications more accessible to developers.