Research & Papers

LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

Solves severe egomotion and occlusion with a novel 'lift-then-fit' transformer approach.

Deep Dive

Tracking multiple people in 3D from a moving VR/AR headset is notoriously hard due to severe egomotion, partial visibility, and occlusions. Existing monocular methods assume static or slowly-moving cameras and fail on dynamic egocentric captures. The Meta team introduces LAMP (Localization Aware Multi-camera People Tracking), which tackles this by early disentanglement of observer and target motion. Their two-step process first leverages the known 6 DoF headset motion and calibration to project detected 2D body keypoints from multiple temporally asynchronous cameras into a unified 3D world reference frame. Then, an end-to-end trained spatio-temporal transformer directly fits 3D human motion to this 3D ray cloud. This 'lift-then-fit' approach learns a natural human motion prior in world-space and flexibly incorporates data from partially observing, moving cameras.

LAMP achieves state-of-the-art results on standard monocular benchmarks while dramatically outperforming baselines in the targeted egocentric multi-camera setting. By modeling both observer and target motion explicitly, the system remains robust even under rapid head movements and heavy occlusions common in VR/AR use cases. The paper, accepted at CVPR 2026, demonstrates that combining device localization with learned motion priors unlocks reliable multi-person 3D tracking from commodity headsets. This capability is essential for immersive mixed reality experiences, social VR, and real-time avatar animation.

Key Points
  • Introduces a 'lift-then-fit' approach that first converts 2D keypoints from multiple cameras into a 3D world ray cloud using known 6 DoF headset motion
  • Uses an end-to-end spatio-temporal transformer to fit 3D human motion directly to the ray cloud, learning a world-space motion prior
  • Achieves SOTA on monocular benchmarks and significantly outperforms baselines in egocentric multi-camera settings

Why It Matters

Enables robust multi-person 3D tracking in real-time VR/AR, unlocking social presence and avatar animation from headset cameras.