An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition

  title={An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition},
  author={Yang Wu and Pai Peng and Zhenyu Zhang and Yanyan Zhao and Bing Qin},
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. However, previous methods only model the feature interactions between the textual and either acoustic and visual modalities, ignoring capturing the feature interactions between the acoustic and visual modalities. In this paper, we propose the multi-modal end-to-end transformer (ME2ET), which can… 

Figures and Tables from this paper



Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition

A modality-transferable model that can directly adapt to the unseen emotions in any modality since it has their pre-trained embeddings and modality mapping functions and outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.

Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

Progressive Modality Reinforcement for Human Multimodal Emotion Recognition from Unaligned Multimodal Sequences

This work proposes the Progressive Modality Reinforcement (PMR) approach based on the recent advances of crossmodal transformer, which introduces a message hub to exchange information with each modality and reinforces their features viaCrossmodal attention.

Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

AST: Audio Spectrogram Transformer

The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, and an approach to transfer knowledge from ImageNet pretrained ViT to AST is proposed.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

The Low-rank Multimodal Fusion method is proposed, which performs multimodal fusion using low-rank tensors to improve efficiency and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.

Tensor Fusion Network for Multimodal Sentiment Analysis

A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

A new Tokens-To-Token Vision Transformer (T2T-VTT), which incorporates an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study and reduces the parameter count and MACs of vanilla ViT by half.