CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation
A new training framework improves BEV semantic segmentation by up to 4.86 mIoU without adding inference cost.
A research team from Korea University and Hyundai Motor Group has introduced CycleBEV, a novel regularization framework that significantly enhances the performance of Bird's-Eye-View (BEV) semantic segmentation models for autonomous vehicles. The core challenge in this domain is the ambiguous transformation of 2D camera images from a perspective view (PV) into a coherent 3D top-down map, a process plagued by depth uncertainty and occlusion. CycleBEV tackles this by applying a 'cycle consistency' principle, inspired by techniques in image-to-image translation, to regularize existing View Transformation (VT) networks during training. This approach forces the model to learn more accurate geometric and semantic relationships by ensuring a transformation from PV to BEV and back again yields consistent results.
The technical innovation lies in the auxiliary Inverse View Transformation (IVT) network, which is only active during training. This IVT network maps the generated BEV segmentation back to the original PV space, creating a closed loop. The framework then applies cycle consistency losses in both geometric and novel 'representation' spaces to better exploit the IVT's capacity. When tested on four representative VT models covering different paradigms using the large-scale nuScenes dataset, CycleBEV delivered consistent and substantial improvements. Most notably, it boosted vehicle detection performance by 4.86 mIoU and pedestrian detection by 3.74 mIoU, critical metrics for safe navigation, all while leaving the final, deployable model's architecture and inference speed unchanged.
- Proposes a cycle consistency framework (CycleBEV) that regularizes BEV segmentation models using an auxiliary Inverse View Transformation (IVT) network only during training.
- Achieves significant performance gains on nuScenes: up to 4.86 mIoU for vehicles and 3.74 mIoU for pedestrians without increasing model complexity at inference time.
- Demonstrates consistent improvements across four different state-of-the-art View Transformation model paradigms, proving the method's general applicability.
Why It Matters
Enables more accurate and reliable environmental perception for self-driving cars, directly improving safety for dynamic objects like vehicles and pedestrians.