Uses dual ConvNeXt V2 backbones with stage-wise fusion?

frequency-based for early features, cross-modal attention for late features.

Achieves state-of-the-art results?

61.73% mIoU on MFNet and 86.24% on PST900 with only 35.43M parameters (lightest variant).

Introduces a Frequency-Based Fusion Module that decomposes thermal data into low/high frequencies via Gaussian filtering, with confidence-gated residual integration?

Introduces a Frequency-Based Fusion Module that decomposes thermal data into low/high frequencies via Gaussian filtering, with confidence-gated residual integration.

Research & Papers

Frequency-Guided Fusion Boosts RGB-Thermal Segmentation with 86.24% mIoU

arXiv cs.CV May 27, 2026

⚡Dual ConvNeXt V2 backbones fuse thermal and RGB features for 3x fewer parameters.

Deep Dive

Researchers from Turkey (İsmail Emre Canıtez and Özgür Erkent) have unveiled a new multi-modal fusion architecture for RGB-thermal semantic segmentation, a key computer vision task for autonomous driving and robotics. Their approach, built upon dual ConvNeXt V2 backbones, employs a stage-wise, modality-adaptive fusion strategy that addresses a long-standing challenge: how to effectively integrate visible and infrared images at different levels of feature abstraction.

For early-stage features, the model uses a Frequency-Based Fusion Module that first decomposes infrared features into low- and high-frequency components via Gaussian filtering. A dual-branch spatial attention mechanism then selectively emphasizes thermal patterns and fine-grained boundaries. These are merged with RGB features through a confidence-gated residual mechanism. For late-stage features, a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions captures semantic correspondences across modalities. The fused features are decoded with a PANet-style bidirectional decoder using deep supervision. On standard benchmarks, the lightest variant achieves 61.73% mean IoU on MFNet and 86.24% on PST900 with just 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. The code is publicly available.

Key Points

Uses dual ConvNeXt V2 backbones with stage-wise fusion: frequency-based for early features, cross-modal attention for late features.
Achieves state-of-the-art results: 61.73% mIoU on MFNet and 86.24% on PST900 with only 35.43M parameters (lightest variant).
Introduces a Frequency-Based Fusion Module that decomposes thermal data into low/high frequencies via Gaussian filtering, with confidence-gated residual integration.

Why It Matters

Enables robust semantic segmentation in low-light conditions for autonomous vehicles and robotics, using fewer compute resources.

Read Original Article

Frequency-Guided Fusion Boosts RGB-Thermal Segmentation with 86.24% mIoU

Why It Matters

Related Articles

🚀 Stay Ahead in AI