Research & Papers

Frequency-Guided Fusion Boosts RGB-Thermal Segmentation with 86.24% mIoU

Dual ConvNeXt V2 backbones fuse thermal and RGB features for 3x fewer parameters.

Deep Dive

Researchers from Turkey (İsmail Emre Canıtez and Özgür Erkent) have unveiled a new multi-modal fusion architecture for RGB-thermal semantic segmentation, a key computer vision task for autonomous driving and robotics. Their approach, built upon dual ConvNeXt V2 backbones, employs a stage-wise, modality-adaptive fusion strategy that addresses a long-standing challenge: how to effectively integrate visible and infrared images at different levels of feature abstraction.

For early-stage features, the model uses a Frequency-Based Fusion Module that first decomposes infrared features into low- and high-frequency components via Gaussian filtering. A dual-branch spatial attention mechanism then selectively emphasizes thermal patterns and fine-grained boundaries. These are merged with RGB features through a confidence-gated residual mechanism. For late-stage features, a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions captures semantic correspondences across modalities. The fused features are decoded with a PANet-style bidirectional decoder using deep supervision. On standard benchmarks, the lightest variant achieves 61.73% mean IoU on MFNet and 86.24% on PST900 with just 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. The code is publicly available.

Key Points
  • Uses dual ConvNeXt V2 backbones with stage-wise fusion: frequency-based for early features, cross-modal attention for late features.
  • Achieves state-of-the-art results: 61.73% mIoU on MFNet and 86.24% on PST900 with only 35.43M parameters (lightest variant).
  • Introduces a Frequency-Based Fusion Module that decomposes thermal data into low/high frequencies via Gaussian filtering, with confidence-gated residual integration.

Why It Matters

Enables robust semantic segmentation in low-light conditions for autonomous vehicles and robotics, using fewer compute resources.