Multimodal Fusion with Contrastive Consistency: A Semi-Supervised Approach for Improved Accuracy in Medical Imaging Classification

Tuesday 08 April 2025


The quest for better image classification has led researchers to combine images and tables, but this approach has its own set of challenges. One major issue is that traditional methods focus on shared features between modalities, leaving task-relevant information from individual modalities unexplored.


A new study proposes a solution to this problem by introducing STiL, a semi-supervised learning framework that comprehensively explores task-relevant information in both images and tables. The framework consists of two main components: a disentangled contrastive consistency module and a consensus-guided pseudo-labeling strategy.


The disentangled contrastive consistency module is designed to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. This allows the model to capture unique characteristics of each modality, rather than simply relying on shared features.


Meanwhile, the consensus-guided pseudo-labeling strategy generates reliable pseudo-labels by aggregating predictions from multiple classifiers and leveraging their consensus. This approach helps reduce the risk of classifier collusion and ensures that the pseudo-labels are more accurate.


The researchers tested STiL on several datasets, including natural and medical images, and compared its performance with state-of-the-art semi-supervised learning methods. The results showed that STiL outperformed other methods in all cases, demonstrating its effectiveness in exploring task-relevant information from both images and tables.


One of the key advantages of STiL is its ability to adapt to different label percentages. In an experiment where only 1% of the data was labeled, STiL still managed to achieve impressive results, whereas other methods struggled. This suggests that STiL can be applied to real-world scenarios where labeled data is scarce.


Another benefit of STiL is its robustness across different tabular encoders. The researchers tested the framework with two different pre-trained tabular encoders and found that STiL remained stable and effective, even when using a less powerful encoder.


In addition to its technical achievements, STiL has practical implications for real-world applications. For instance, in medical diagnosis, STiL could be used to analyze images and tables from patients’ records to improve diagnostic accuracy.


The study’s findings suggest that by combining semi-supervised learning with disentangled contrastive consistency and consensus-guided pseudo-labeling, researchers can develop more effective frameworks for exploring task-relevant information in multimodal data.


Cite this article: “Multimodal Fusion with Contrastive Consistency: A Semi-Supervised Approach for Improved Accuracy in Medical Imaging Classification”, The Science Archive, 2025.


Image Classification, Semi-Supervised Learning, Stil, Disentangled Contrastive Consistency, Consensus-Guided Pseudo-Labeling, Multimodal Data, Natural Images, Medical Images, Tabular Encoders, Diagnostic Accuracy


Reference: Siyi Du, Xinzhe Luo, Declan P. O’Regan, Chen Qin, “STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification” (2025).


Leave a Reply