MERVIN: New multimodal framework beats Vietnamese news video retrieval challenge
Scored 79/88 in AI Challenge HCMC 2025 qualification and retrieved all results in finals
MERVIN addresses the challenge of semantically grounded event retrieval in Vietnamese news videos, where noise from accents, background sounds, and recognition errors often degrades transcript quality. The framework uses Gemini 1.5 Flash to enhance transcripts, a Perception Encoder for visual features, and a Vietnamese language model for text embeddings. All features are indexed in Milvus, enabling efficient similarity-based cross-modal retrieval. A React-based UI lets users refine queries iteratively across modalities (keyframes, transcripts, summaries), improving semantic alignment.
Tested on Vietnamese news videos, MERVIN achieved 79 out of 88 points in the qualification phase of AI Challenge HCMC 2025 and successfully retrieved every relevant result for all queries in the final round. The paper has been accepted to SOICT 2025. This work demonstrates the growing power of multimodal retrieval systems tailored for specific languages and noisy input conditions, offering a blueprint for other low-resource language video retrieval tasks.
- Integrates keyframes, Gemini-1.5-Flash-enhanced transcripts, and video summaries for multimodal retrieval
- Indexes visual features (Perception Encoder) and Vietnamese language embeddings in Milvus for fast similarity search
- Scored 79/88 in AI Challenge HCMC 2025 qualification and achieved perfect retrieval in final round
Why It Matters
MERVIN shows how to build robust video retrieval for languages with noisy ASR, advancing media search in Vietnamese