Wednesday 10 September 2025
A team of researchers has made a significant breakthrough in the field of multimodal image fusion, developing a novel approach that can effectively combine infrared and visible images for improved object detection and segmentation tasks.
The new method, dubbed MambaTrans, leverages large language models to adapt multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. This allows the pre-trained models to learn from the combined information without requiring any adjustments to their parameters.
Traditionally, pre-trained models are trained on large-scale natural image RGB datasets such as COCO, which can result in subpar performance when applied to multimodal fusion images. The significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images.
MambaTrans addresses this issue by proposing a novel multimodal fusion image modality translator. This translator uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module to enhance pure visual capabilities.
By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters.
Experiments on public datasets demonstrate that MambaTrans effectively improves multimodal image performance in downstream tasks such as object detection and semantic segmentation. The approach has significant implications for applications like night-time surveillance, autonomous driving, and security, where the fusion of infrared and visible images can provide valuable information about targets under low-light conditions.
The development of MambaTrans represents a major step forward in the field of multimodal image fusion, enabling the effective use of pre-trained models with multimodal fusion images. The approach has the potential to significantly improve the performance of various visual tasks, ultimately contributing to more accurate and efficient decision-making in a range of applications.
Cite this article: “Multimodal Image Fusion Breakthrough: MambaTrans Unlocks Pre-Trained Model Potential”, The Science Archive, 2025.
Multimodal Image Fusion, Mambatrans, Infrared Images, Visible Images, Object Detection, Semantic Segmentation, Large Language Models, Multimodal Fusion Images, Multimodal Translator, Pre-Trained Models.







