Interactive Multi-Turn Retrieval for Health Videos
Single-turn search fails for medical videos; multi-turn queries improve precision by 40%
Health video retrieval systems have long been single-turn—users submit one query and get one static list. That's brittle in clinical contexts where information needs evolve: a trainee might start with 'how to draw blood' then refine with 'left arm, tourniquet above elbow, no anticoagulants.' Existing systems can't handle such follow-up constraints. A team led by Chengzheng Wu constructed MHVRC (Multi-Turn Health Video Retrieval Corpus) by grounding video descriptions from VideoChat-Flash with query refinements generated by DeepSeek. They then propose DATR (Dialogue-Aware Two-Stage Retrieval), which first performs coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module.
Experiments on MHVRC show consistent gains over strong text-video retrieval baselines. User studies confirm that multi-turn queries better capture fine-grained procedural semantics—key for tasks like patient rehabilitation or surgical training. The work provides both a benchmark (MHVRC) and a scalable technical recipe for interactive health video retrieval. While still a research paper, the approach could be integrated into medical education platforms (e.g., for resident training) or hospital patient education tools, where iterative refinement mirrors real-world clinical queries. The paper is available on arXiv (2605.01409).
- MHVRC dataset combines video-grounded descriptions from VideoChat-Flash with query refinements from DeepSeek for multi-turn health queries.
- DATR uses a CLIP-style dual encoder for coarse retrieval, then a cross-encoder for re-ranking after fusing multiple query turns.
- User studies show refined multi-turn queries better capture procedural semantics than single-turn annotations.
Why It Matters
Enables precise, context-aware search for medical training and patient education, matching real clinical workflows.