Research & Papers

MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry

New training-free method lets neural geometry models process massive image collections without hitting memory walls.

Deep Dive

A research team led by Leo Kaixuan Cheng, Ruofan Liang, and others has unveiled MERG3R, a novel framework designed to overcome a fundamental bottleneck in neural 3D reconstruction. While transformer-based models like VGGT and Pi3 have achieved high accuracy, their reliance on full attention mechanisms makes them memory-bound, unable to process large, unordered collections of images (like thousands of tourist photos of a landmark) on standard GPUs. MERG3R solves this not by retraining models, but by introducing a clever, training-free preprocessing and merging pipeline that allows these existing 'geometric foundation models' to operate far beyond their native limits.

The framework works in three core stages: it first intelligently reorders and partitions a massive set of input images into smaller, overlapping subsets that are geometrically diverse yet manageable. Each subset is then fed independently into an existing neural reconstruction model. Finally, MERG3R's key innovation merges these local 3D reconstructions through an efficient global alignment and a confidence-weighted bundle adjustment, producing a single, consistent large-scale model. Tested on major benchmarks including 7-Scenes and Tanks & Temples, MERG3R consistently improved reconstruction accuracy and memory efficiency. This breakthrough in scalability paves the way for creating detailed 3D models from internet-scale photo collections, which has significant implications for mapping, heritage preservation, and augmented reality.

Key Points
  • Model-agnostic framework that works with existing neural geometry models like VGGT and Pi3 without retraining.
  • Uses a divide-and-conquer strategy to partition large image sets into manageable subsets, bypassing GPU memory constraints.
  • Demonstrated improved accuracy and scalability on large-scale datasets including Tanks & Temples and Cambridge Landmarks.

Why It Matters

Enables creation of high-quality 3D models from massive, unstructured photo collections for mapping, AR, and digital preservation.