Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

  title={Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph},
  author={Amir Zadeh and Paul Pu Liang and Soujanya Poria and E. Cambria and Louis-Philippe Morency},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
Analyzing human multimodal language is an emerging area of research in NLP. [] Key Result Unlike previously proposed fusion techniques, DFG is highly interpretable and achieves competitive performance compared to the current state of the art.

Figures and Tables from this paper

IMCN: Identifying Modal Contribution Network for Multimodal Sentiment Analysis

A highly generalized identifying modal contributions network (IMCN) is proposed, which contains modality interaction module, modality fusion, and modality joint learning units in the framework and is compared with other popular multimodal sentiment analysis models.

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

The network model encodes multimodal data through a Bidirectional Encoder Representations from Transformers (BERT) network and Transformer encoder to resolve long-term dependencies within modalities and reconstructs the Transformer decoder to solve the weight problem of multimodAL data in an iterative way.

MOSEAS: A Multimodal Language Dataset for Spanish, Portuguese, German and French

The first large-scale multimodal language dataset for Spanish, Portuguese, German and French, called CMU-MOSEAS (CMU Multimodal Opinion Sentiment, Emotions and Attributes), is introduced, which is the largest of its kind with 40, 000 total labelled sentences.

Improving Multimodal fusion via Mutual Dependency Maximisation

This work investigates unexplored penalties and proposes a set of new objectives that measure the dependency between modalities and demonstrates that the new penalties lead to a consistent improvement across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: CMU-MOSI and CMU -MOSEI.

Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis

A new method and model for processing multimodal signals, which takes into account the delay and hysteresis characteristics of multi-modal signals across the time dimension is proposed.

SWAFN: Sentimental Words Aware Fusion Network for Multimodal Sentiment Analysis

The experimental results show that introducing sentimental words prediction as a multitask can really improve the fusion representation of multiple modalities.

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Extensive experimentation on Sentiment Analysis and Emotion Recognition indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification.

Dynamic Invariant-Specific Representation Fusion Network for Multimodal Sentiment Analysis

A new framework, namely, dynamic invariant-specific representation fusion network (DISRFN), is put forward in this study, and the experimental results verify the effectiveness of the DISRFN framework and loss function.

CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality

This paper introduces a Chinese single- and multi-modal sentiment analysis dataset, CH-SIMS, which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations, and proposes a multi-task learning framework based on late fusion as the baseline.

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

The empirical results illustrate that the proposed MCMulT model not only outperforms existing approaches on unaligned multimodal sequences but also has strong performance on aligned multimmodal sequences.



Multimodal sentiment analysis with word-level fusion and reinforcement learning

The Gated Multimodal Embedding LSTM with Temporal Attention model is proposed that is composed of 2 modules and able to perform modality fusion at the word level and is able to better model the multimodal structure of speech through time and perform better sentiment comprehension.

Tensor Fusion Network for Multimodal Sentiment Analysis

A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods are proposed: a Seq2 Seq Modality Translation Model and a Hierarchical Seq1Seq Modalities Translation Model.

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

This paper introduces to the scientific community the first opinion-level annotated corpus of sentiment and subjectivity analysis in online videos called Multimodal Opinion-level Sentiment Intensity dataset (MOSI), which is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, andper-milliseconds annotated audio features.

Utterance-Level Multimodal Sentiment Analysis

It is shown that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.

Multi-attention Recurrent Network for Human Communication Comprehension

The main strength of the model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent part called the Long-short Term Hybrid Memory (LSTHM).

Towards multimodal sentiment analysis: harvesting opinions from the web

This paper addresses the task of multimodal sentiment analysis, and conducts proof-of-concept experiments that demonstrate that a joint model that integrates visual, audio, and textual features can be effectively used to identify sentiment in Web videos.

Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis

A novel method to extract features from visual and textual modalities using deep convolutional neural networks and significantly outperform the state of the art of multimodal emotion recognition and sentiment analysis on different datasets is presented.

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

This article addresses the fundamental question of exploiting the dynamics between visual gestures and verbal messages to be able to better model sentiment by introducing the first multimodal dataset with opinion-level sentiment intensity annotations and proposing a new computational representation, called multi-modal dictionary, based on a language-gesture study.

Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis

A Select-Additive Learning (SAL) procedure that improves the generalizability of trained discriminative neural networks and increases prediction accuracy significantly in all three modalities (text, audio, video), as well as in their fusion.