Image & Video

New self-supervised depth prediction technique outperforms on low-texture scenes

Distance transform over pre-semantic contours boosts depth accuracy in uniform regions across 5 benchmarks

Deep Dive

Self-supervised monocular depth estimation (MDE) struggles in low-texture regions because photometric losses become ambiguous. A new paper by Marwane Hariat, Antoine Manzanera, and David Filliat tackles this by applying a distance transform over pre-semantic contours—edge maps extracted before semantic classification. This augmentation increases variance in uniform areas, making loss functions more effective. The network jointly learns pre-semantic contours, depth, and ego-motion, with theoretical proof that the distance transform is optimal for variance augmentation.

Extensive experiments on five major datasets (KITTI, Cityscapes, Waymo, NYUv2, and ScanNet) show the method surpasses all compared self-supervised techniques. The approach is particularly robust on indoor scenes (NYUv2) and autonomous driving benchmarks (KITTI, Waymo), where low-texture surfaces like walls or roads previously caused errors. This work offers a practical, label-free solution for improving depth perception in real-world applications.

Key Points
  • Applies distance transform on pre-semantic contours to boost depth prediction in low-texture areas
  • Jointly estimates contours, depth, and ego-motion in a single self-supervised framework
  • Outperforms competing self-supervised methods on KITTI, Cityscapes, Waymo, NYUv2, and ScanNet

Why It Matters

Enables more reliable depth estimation for autonomous driving, robotics, and AR without needing expensive labeled data.