Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

  title={Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis},
  author={Wenmeng Yu and Hua Xu and Ziqi Yuan and Jiele Wu},
Representation Learning is a significant and challenging task in multimodal learning. Effective modality representations should contain two parts of characteristics: the consistency and the difference. Due to the unified multimodal annota- tion, existing methods are restricted in capturing differenti- ated information. However, additional unimodal annotations are high time- and labor-cost. In this paper, we design a la- bel generation module based on the self-supervised learning strategy to… 

Figures and Tables from this paper

The Weighted Cross-Modal Attention Mechanism With Sentiment Prediction Auxiliary Task for Multimodal Sentiment Analysis

This paper designs the weighted cross-modal attention mechanism, which not only captures the temporal correlation information and the spatial dependence information of each modality, but also dynamically adjusts the weight of eachmodality across different time steps.

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

A novel framework named MultiModal Contrastive Learning (MMCL) is proposed for multimodal representation to capture intraand inter-modality dynamics simultaneously and designs two contrastive learning tasks, instanceand sentiment-based contrastivelearning, to promote the process of prediction and learn more interactive information related to sentiment.

Sense-aware BERT and Multi-task Fine-tuning for Multimodal Sentiment Analysis

This paper proposes Sense-aware BERT (SenBERT) which allows sense information integrated with BERT during fine-tuning and exploits multimodal multi-head attention to capture the interaction between unaligned multimodAL data.

Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning

This paper proposes unified pre-training for multi-modal prompt-based fine-tuning (UP-MPF) with two stages, and employs a simple and effective task to obtain coherent vision-language representations from fixed pre-trained language models (PLMs).

Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

EMT-DLFR employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other, and innovatively regards complete and incomplete data as two different views of one sample.

Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis

Adapted Multimodal BERT (AMB) is proposed, a BERT-based architecture for multimodal tasks that uses a combination of adapter modules and intermediate fusion layers that leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.

Cross-Modality Gated Attention Fusion for Multimodal Sentiment Analysis

CMGA, a Cross-Modality Gated Attention fusion model for MSA that tends to make adequate interaction across different modality pairs is proposed and adds a forget gate to remove the noisy and redundant signals introduced in the interaction procedure.

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

Experiments reveal that methods with domain-specific pre-trained encoders attain better performance than those with conventional features in both unimodal and multimodal scenarios.

CMJRT: Cross-Modal Joint Representation Transformer for Multimodal Sentiment Analysis

A novel multimodal sentiment analysis framework named Cross-Modal Joint Representation Transformer (CMJRT), which exploits hierarchical interactions among modalities by passing joint representations from bimodality to unimodality, and which outperforms existing approaches.

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

This work proposes a framework named MultiModal InfoMax (MMIM), which hierarchically maximizes the Mutual Information (MI) in unimodal input pairs (inter-modality) and between multimodal fusion result and unimmodal input in order to maintain task-related information through multimodAL fusion.



CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality

This paper introduces a Chinese single- and multi-modal sentiment analysis dataset, CH-SIMS, which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations, and proposes a multi-task learning framework based on late fusion as the baseline.

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

A novel framework, MISA, is proposed, which projects each modality to two distinct subspaces, which provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions.

Learning Factorized Multimodal Representations

A model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors is introduced that demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance.

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

A deep multi-task learning framework that jointly performs sentiment and emotion analysis both as well as a context-level inter-modal attention framework for simultaneously predicting the sentiment and expressed emotions of an utterance is presented.

Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

The Low-rank Multimodal Fusion method is proposed, which performs multimodal fusion using low-rank tensors to improve efficiency and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.

Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.

Tensor Fusion Network for Multimodal Sentiment Analysis

A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

A novel model, the Interaction Canonical Correlation Network (ICCN), is proposed, which learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms.