DPO slashes code-switching errors by 89.6% in Audio LLMs
Training on 100K preference pairs fixes three failure modes in bilingual speech
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team of researchers from Singapore and Vietnam (Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw) published a paper on arXiv showing that Direct Preference Optimization (DPO) can dramatically improve Audio LLMs' ability to transcribe code-switching speech between English and Mandarin. They identified three systematic failure modes: language omission (dropping one language), translation-instead-of-transcription (rendering mixed speech as a single language), and hallucination (inventing content).
To fix these, they constructed 100,000 preference pairs (570 hours of audio) where chosen responses preserved the mixed-language composition and rejected responses mimicked the failure patterns. Training three different Audio LLMs with DPO produced consistent behavioral shifts: models learned to preserve language composition rather than translating when asked to transcribe. The method achieved up to 89.6% reduction in mixed-error rate (MER) on in-distribution data and 20.0% on out-of-distribution data, suggesting DPO is a scalable solution for eliciting correct code-switching behavior from multilingual Audio LLMs.
- Trained three Audio LLMs on 100K preference pairs (570 hours) using DPO
- Identified three failure modes: language omission, translation-instead-of-transcription, and hallucination
- Achieved MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution
Why It Matters
Enables accurate bilingual transcription, crucial for multilingual assistants, call centers, and voice interfaces in English-Mandarin markets.