Integrates keyframes, Gemini-1.5-Flash-enhanced transcripts, and video summaries for multimodal retrieval?

Integrates keyframes, Gemini-1.5-Flash-enhanced transcripts, and video summaries for multimodal retrieval

Indexes visual features (Perception Encoder) and Vietnamese language embeddings in Milvus for fast similarity search?

Indexes visual features (Perception Encoder) and Vietnamese language embeddings in Milvus for fast similarity search

Scored 79/88 in AI Challenge HCMC 2025 qualification and achieved perfect retrieval in final round?

Scored 79/88 in AI Challenge HCMC 2025 qualification and achieved perfect retrieval in final round

Research & Papers

MERVIN: New multimodal framework beats Vietnamese news video retrieval challenge

arXiv cs.IR May 18, 2026

⚡Scored 79/88 in AI Challenge HCMC 2025 qualification and retrieved all results in finals

Deep Dive

MERVIN addresses the challenge of semantically grounded event retrieval in Vietnamese news videos, where noise from accents, background sounds, and recognition errors often degrades transcript quality. The framework uses Gemini 1.5 Flash to enhance transcripts, a Perception Encoder for visual features, and a Vietnamese language model for text embeddings. All features are indexed in Milvus, enabling efficient similarity-based cross-modal retrieval. A React-based UI lets users refine queries iteratively across modalities (keyframes, transcripts, summaries), improving semantic alignment.

Tested on Vietnamese news videos, MERVIN achieved 79 out of 88 points in the qualification phase of AI Challenge HCMC 2025 and successfully retrieved every relevant result for all queries in the final round. The paper has been accepted to SOICT 2025. This work demonstrates the growing power of multimodal retrieval systems tailored for specific languages and noisy input conditions, offering a blueprint for other low-resource language video retrieval tasks.

Key Points

Integrates keyframes, Gemini-1.5-Flash-enhanced transcripts, and video summaries for multimodal retrieval
Indexes visual features (Perception Encoder) and Vietnamese language embeddings in Milvus for fast similarity search
Scored 79/88 in AI Challenge HCMC 2025 qualification and achieved perfect retrieval in final round

Why It Matters

MERVIN shows how to build robust video retrieval for languages with noisy ASR, advancing media search in Vietnamese

Read Original Article

MERVIN: New multimodal framework beats Vietnamese news video retrieval challenge

Why It Matters

Related Articles

🚀 Stay Ahead in AI