Trained three Audio LLMs on 100K preference pairs (570 hours) using DPO?

Trained three Audio LLMs on 100K preference pairs (570 hours) using DPO

Identified three failure modes?

language omission, translation-instead-of-transcription, and hallucination

Achieved MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution?

Achieved MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution

Research & Papers

DPO slashes code-switching errors by 89.6% in Audio LLMs

arXiv cs.CL May 26, 2026

⚡Training on 100K preference pairs fixes three failure modes in bilingual speech

Deep Dive

A team of researchers from Singapore and Vietnam (Trung Nguyen Quang, Cheng Yi Lewis Won, Minh Duc Pham, Yingxu He, Shuo Sun, Ai Ti Aw) published a paper on arXiv showing that Direct Preference Optimization (DPO) can dramatically improve Audio LLMs' ability to transcribe code-switching speech between English and Mandarin. They identified three systematic failure modes: language omission (dropping one language), translation-instead-of-transcription (rendering mixed speech as a single language), and hallucination (inventing content).

To fix these, they constructed 100,000 preference pairs (570 hours of audio) where chosen responses preserved the mixed-language composition and rejected responses mimicked the failure patterns. Training three different Audio LLMs with DPO produced consistent behavioral shifts: models learned to preserve language composition rather than translating when asked to transcribe. The method achieved up to 89.6% reduction in mixed-error rate (MER) on in-distribution data and 20.0% on out-of-distribution data, suggesting DPO is a scalable solution for eliciting correct code-switching behavior from multilingual Audio LLMs.

Key Points

Trained three Audio LLMs on 100K preference pairs (570 hours) using DPO
Identified three failure modes: language omission, translation-instead-of-transcription, and hallucination
Achieved MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution

Why It Matters

Enables accurate bilingual transcription, crucial for multilingual assistants, call centers, and voice interfaces in English-Mandarin markets.

Read Original Article

DPO slashes code-switching errors by 89.6% in Audio LLMs

Why It Matters

Related Articles

🚀 Stay Ahead in AI