Limitations of Contrastive Language-Image Pre-training Models in Multi-Object Scenarios

Sunday 30 March 2025


The latest advancements in artificial intelligence have led to significant breakthroughs in various fields, including computer vision and natural language processing. A recent study has shed light on the limitations of Contrastive Language-Image Pre-training (CLIP) models, a popular approach used to integrate visual and linguistic data.


CLIP models are designed to learn from large datasets of images and their corresponding captions. This allows them to develop an understanding of object relationships and contexts, enabling applications such as image classification, object detection, and image captioning. However, researchers have found that CLIP models struggle with complex multi-object scenarios, where multiple objects are present in the same image.


To address this issue, a team of scientists conducted a comprehensive analysis of CLIP’s performance limitations in multi-object contexts. They created two custom datasets, SimCO and ComCO, which consist of images with varying numbers of objects (2-5) and their corresponding captions. The researchers then evaluated the performance of different CLIP models on these datasets.


The results showed that CLIP models tend to favor larger objects in images, prioritizing them in captions even when smaller objects are mentioned first. This bias is evident in both image-text matching and text-based object retrieval tasks. For instance, when given an image with a zebra and a truck, the model is more likely to recognize the truck as the primary object, even if the caption mentions the zebra first.


The study also found that reordering objects in captions can significantly impact CLIP’s performance. When the order of objects in a caption is altered, the model’s accuracy drops dramatically. This highlights the importance of considering object relationships and contexts when designing CLIP models.


To further understand these biases, the researchers analyzed the COCO dataset, a large collection of images with annotated objects. They discovered that larger objects are more frequently mentioned first in captions, which reinforces their findings on SimCO and ComCO.


The implications of this study are significant, as they highlight the need for more nuanced approaches to integrating visual and linguistic data. By acknowledging the limitations of CLIP models, researchers can develop more robust and accurate systems that better understand object relationships and contexts.


Moreover, this study serves as a reminder of the importance of testing AI models on diverse datasets and evaluating their performance in various scenarios. As AI continues to advance, it is crucial to identify and address its biases and limitations to ensure its applications are fair, reliable, and effective.


Cite this article: “Limitations of Contrastive Language-Image Pre-training Models in Multi-Object Scenarios”, The Science Archive, 2025.


Artificial Intelligence, Contrastive Language-Image Pre-Training, Computer Vision, Natural Language Processing, Object Relationships, Image Captioning, Object Detection, Image Classification, Bias, Limitations


Reference: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah, “Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study” (2025).


Leave a Reply