Multimodal Emotion Recognition System Outperforms State-of-the-Art Models

Thursday 23 January 2025


Researchers have made significant progress in developing machines that can recognize human emotions, but most of these systems rely on a single modality, such as facial expressions or speech patterns. However, humans communicate emotions in a multimodal way, combining verbal and non-verbal cues to convey their feelings.


A team of scientists has now developed a system called MERITS-L (Multimodal Emotion Recognition In Conversations with Text-based Utterance Embeddings) that can recognize emotions in conversations by analyzing both the text and audio modalities. The system uses a novel hierarchical training approach, starting with utterances from a single modality, followed by contextual modeling at the conversational level, and finally, alignment of the two modalities.


The researchers tested MERITS-L on three datasets: IEMOCAP, MELD, and CMU-MOSI. The results show that the system outperforms state-of-the-art models for two out of the three datasets, achieving a weighted F1-score of 86.48%, 85.78%, and 81.45% respectively.


One of the key innovations of MERITS-L is its use of pre-trained language models to generate pseudo-emotion labels from speech transcripts. These labels are then used to train a text-based emotion recognition model, which serves as an utterance-level text embedding extractor. The system also uses a multimodal fusion strategy that combines the output of the text and audio modalities using a co-attention mechanism.


The researchers evaluated MERITS-L with different pre-trained language models (LLMs) and found that GPT-3.5 Turbo achieved the best performance, with an overlap of 52.98% between the generated labels and the ground truth emotions. They also compared MERITS-L to other state-of-the-art works and found that it outperforms them on two out of the three datasets.


The development of MERITS-L has significant implications for human-computer interaction, customer service, and mental health monitoring. The system can be used to analyze emotions in conversations and provide personalized recommendations or support. Additionally, the hierarchical training approach used by MERITS-L could be applied to other multimodal tasks, such as gesture recognition or facial expression analysis.


Overall, the development of MERITS-L is an important step towards building machines that can recognize human emotions in a more natural and intuitive way.


Cite this article: “Multimodal Emotion Recognition System Outperforms State-of-the-Art Models”, The Science Archive, 2025.


Multimodal Emotion Recognition, Conversational Analysis, Text-Based Utterance Embeddings, Hierarchical Training Approach, Pre-Trained Language Models, Co-Attention Mechanism, Human-Computer Interaction, Customer Service, Mental Health Monitoring, Gpt


Reference: Soumya Dutta, Sriram Ganapathy, “LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations” (2025).


Leave a Reply