Image & Video

VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

arXiv eess.IV February 19, 2026

⚡New system maps indoor scenes in real-time using under 17GB VRAM, regardless of video length.

Deep Dive

A research team led by Anna Gelencsér-Horváth developed SceneVGGT, a 3D semantic SLAM framework. Built on VGGT architecture, it performs online 3D mapping from video streams using a sliding-window pipeline that maintains geometric consistency. The system lifts 2D instance masks to 3D objects for coherent tracking, uses under 17GB GPU memory, and achieves competitive performance on ScanNet++. It enables interactive assistive navigation with audio feedback by projecting object locations onto floor planes.

Why It Matters

Enables real-time navigation aids for visually impaired users and more capable autonomous robots in complex indoor environments.

Read Original Article

VGGT-based online 3D semantic SLAM for indoor scene understanding and navigation

Why It Matters

Stay Ahead in AI