Multimodal Transformer for Unaligned Multimodal Language Sequences

  title={Multimodal Transformer for Unaligned Multimodal Language Sequences},
  author={Yao-Hung Hubert Tsai and Shaojie Bai and Paul Pu Liang and J. Zico Kolter and Louis-Philippe Morency and Ruslan Salakhutdinov},
  journal={Proceedings of the conference. Association for Computational Linguistics. Meeting},
Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. [] Key Method At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large…

Figures and Tables from this paper

Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences

A predictive self-attention module is used to capture reliable contextual dependencies and enhance the unique features over the modality-specific spaces and a double-discriminator strategy is presented to ensure the production of distinct representations in an adversarial manner.

MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences

Modal-Temporal Attention Graph is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data and achieves state-of-the-art performance on multimodals sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences

By learning to focus only on the important interactions within the graph, the proposed MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.

Factorized Multimodal Transformer for Multimodal Sequential Learning

A new transformer model, called the Factorized Multimodal Transformer (FMT), which inherently models the intramodal and intermodal dynamics within its multimodal input in a factorized manner and shows superior performance over previously proposed models.

Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

Attention is not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion

The Modality-Invariant Crossmodal Attention (MICA) approach towards learning crossmodal interactions over modality-invariant space in which the distribution mismatch between different modalities is well bridged is proposed.

A novel multimodal dynamic fusion network for disfluency detection in spoken utterances

This paper proposes a novel multimodal architecture for disfluency detection from individual utterances that leverages a multimodals dynamic fusion network that adds minimal parameters over an exist-ing text encoder commonly used in prior art to leverage the prosodic and acoustic cues hidden in speech.

Hierachical Delta-Attention Method for Multimodal Fusion

This work attempts at preserving the long-range dependencies within and across different modalities, which would be bottle-necked by the use of recurrent networks and adds the concept of delta-attention to focus on local differences per modality to capture the idiosyncrasy of different people.

Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

A novel model, termed Multimodal Graph, is proposed to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data and devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities.

Interpretable Multimodal Routing for Human Multimodal Language

This paper proposes Multimodal Routing to separate the contributions to the prediction from each modality and the interactions between modalities, and provides both global and local interpretation using this method on sentiment analysis and emotion prediction, without loss of performance compared to state-of-the-art methods.



Learning Factorized Multimodal Representations

A model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors is introduced that demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance.

Multimodal Language Analysis with Recurrent Multistage Fusion

The Recurrent Multistage Fusion Network (RMFN) is proposed which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion.

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

This paper proposes a method to learn robust joint representations by translating between modalities based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.

Audio-Visual Fusion for Sentiment Classification using Cross-Modal Autoencoder

This paper proposes a novel model combining deep canonical correlation analysis (DCCA) with cross-modal autoencoders that tries to reconstruct the representations corresponding to the missing modality, using the DCCA transformed representations of the available input modalities.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.

Multimodal learning with deep Boltzmann machines

A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

Combining Language and Vision with a Multimodal Skip-gram Model

Since they propagate visual information to all words, the MMSKIP-GRAM models discover intriguing visual properties of abstract words, paving the way to realistic implementations of embodied theories of meaning.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.