Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
Language-guided cues help MLLMs see through occlusion and tiny objects...
Researchers propose Language-Guided Semantic Cues (LGSCs) to improve Multimodal Large Language Model (MLLM) grounding in crowded scenes. Their Semantic Cue Extractor (SCE) derives semantic cues of objects from the visual pipeline of an MLLM, then guides these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. These are reintegrated into the original visual pipeline to refine object semantics. Experiments demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
- LGSCs use text embeddings to guide visual semantic cues, overcoming occlusion and small objects in MLLMs.
- Semantic Cue Extractor (SCE) extracts visual cues and refines them with language, boosting grounding accuracy.
- Method validated at ICASSP 2026, requires no full retraining, and is practical for crowded scene applications.
Why It Matters
Enables MLLMs to reliably ground objects in crowded scenes, critical for autonomous driving and robotics.