Advances in Audio-Visual Segmentation Using Pre-Trained Text-Prompted Models

Friday 28 March 2025


A team of researchers has made a significant breakthrough in the field of audio-visual segmentation, which involves identifying and segmenting sounds and visual objects in video recordings. The new approach uses a pre-trained text-prompted model called Segment Anything Model (SAM) to learn the relationship between audio and visual features.


Traditionally, audio-visual segmentation has been a challenging task due to the lack of labeled data and the complexity of the problem itself. However, with the advent of deep learning models, researchers have been able to develop more accurate approaches to this problem.


The new approach uses a multimodal encoder that is trained on large amounts of text-image paired datasets. This allows the model to learn the relationship between text and visual features, which can then be used to identify and segment sounds and visual objects in video recordings.


One of the key innovations of this approach is the use of a novel feature called fCLIP ⊙fCLAP, which captures the intersection of audio and visual modalities. This feature is obtained by projecting both audio and visual embeddings into a shared semantic space using learned projection matrices.


The model is then fine-tuned on a specific audio-visual segmentation task using a dataset that includes labeled video recordings. The results show that the new approach outperforms previous state-of-the-art methods on both datasets, with an improvement of 1.56 MJ without introducing extra adapter layers into every image encoder block.


Furthermore, the model is able to segment sounds and visual objects in videos even when there are multiple sound sources present. This is a significant advancement over previous approaches that were only able to handle single sound source scenarios.


The implications of this breakthrough are far-reaching, with potential applications in fields such as video surveillance, audio-visual indexing, and human-computer interaction. For example, the new approach could be used to automatically segment and identify sounds in videos, allowing for more efficient searching and retrieval of relevant information.


In addition, the model’s ability to handle multiple sound sources could be used to improve audio-visual scene understanding, enabling more accurate recognition of complex scenes and objects.


Overall, this breakthrough has the potential to revolutionize the field of audio-visual segmentation, enabling more accurate and efficient identification and segmentation of sounds and visual objects in video recordings. The use of pre-trained text-prompted models and multimodal encoders has opened up new possibilities for researchers and has paved the way for further advancements in this area.


Cite this article: “Advances in Audio-Visual Segmentation Using Pre-Trained Text-Prompted Models”, The Science Archive, 2025.


Audio-Visual Segmentation, Deep Learning, Multimodal Encoder, Text-Prompted Model, Sam, Fclip, Fclap, Video Recordings, Audio-Visual Features, Object Recognition


Reference: Kyungbok Lee, You Zhang, Zhiyao Duan, “Audio Visual Segmentation Through Text Embeddings” (2025).


Leave a Reply