Research & Papers

DPO slashes code-switching errors by 89.6% in Audio LLMs

Training on 100K preference pairs fixes three failure modes in bilingual speech

Deep Dive

A team of researchers from Singapore and Vietnam (Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw) published a paper on arXiv showing that Direct Preference Optimization (DPO) can dramatically improve Audio LLMs' ability to transcribe code-switching speech between English and Mandarin. They identified three systematic failure modes: language omission (dropping one language), translation-instead-of-transcription (rendering mixed speech as a single language), and hallucination (inventing content).

To fix these, they constructed 100,000 preference pairs (570 hours of audio) where chosen responses preserved the mixed-language composition and rejected responses mimicked the failure patterns. Training three different Audio LLMs with DPO produced consistent behavioral shifts: models learned to preserve language composition rather than translating when asked to transcribe. The method achieved up to 89.6% reduction in mixed-error rate (MER) on in-distribution data and 20.0% on out-of-distribution data, suggesting DPO is a scalable solution for eliciting correct code-switching behavior from multilingual Audio LLMs.

Key Points
  • Trained three Audio LLMs on 100K preference pairs (570 hours) using DPO
  • Identified three failure modes: language omission, translation-instead-of-transcription, and hallucination
  • Achieved MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution

Why It Matters

Enables accurate bilingual transcription, crucial for multilingual assistants, call centers, and voice interfaces in English-Mandarin markets.