Multimodal End-to-End Sparse Model for Emotion Recognition

  title={Multimodal End-to-End Sparse Model for Emotion Recognition},
  author={Wenliang Dai and Samuel Cahyawijaya and Zihan Liu and Pascale Fung},
Existing works in multimodal affective computing tasks, such as emotion recognition and personality recognition, generally adopt a two-phase pipeline by first extracting feature representations for each single modality with hand crafted algorithms, and then performing end-to-end learning with extracted features. However, the extracted features are fixed and cannot be further fine-tuned on different target tasks, and manually finding feature extracting algorithms does not generalize or scale… 

Figures and Tables from this paper

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

This work trains model end-to-end, which allows early layers of neural network to be adapted with taking into account later, fusion layers, of two modalities, and all layers of the model was fine-tuned for downstream task of emotion recognition, so there were no need to train neural networks from scratch.

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

This work considers discrete emotions, and as modalities text, audio and vision are used, and end-to-end feature learning approach is the first attempt in MER literature based on contrastive loss between pairwise modalities.

Weakly-supervised Multi-task Learning for Multimodal Affect Recognition

This paper explores three multimodal affect recognition tasks: 1) emotion recognition; 2) sentiment analysis; and 3) sarcasm recognition and suggests that weak supervision can provide a comparable contribution to strong supervision if the tasks are highly correlated.

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

The use of behaviour encoding is proposed which boosts performance with minimal change to the model and processes the entire input together to avoid using a large model for video processing.

A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition

A novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition that achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters.

Multilevel Transformer For Multimodal Emotion Recognition

A novel multi-granularity framework is introduced, which combines fine-grained representation with pre-trained utterance-level representation and outperform previous state-of-the-art approaches on IEMOCAP dataset with text transcripts and speech signal.

Multimodal interaction enhanced representation learning for video emotion recognition

A semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences.

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

The RILA model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition.

Survey on Emotion Recognition Databases

This paper focused on identifying the availability of datasets relevant to the elderly and children, the primary goals of social robots, as well as reviewing the data collection and annotation processes.

Do Multimodal Emotion Recognition Models Tackle Ambiguity?

It is concluded that although databases provide annotations with ambiguity, most of these models do not fully exploit them, showing that there is still room for improvement in multimodal emotion recognition systems.