• Corpus ID: 225062253

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences

  title={MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences},
  author={Jianing Yang and Yongxin Wang and Ruitao Yi and Yuying Zhu and Azaan Rehman and Amir Zadeh and Soujanya Poria and Louis-Philippe Morency},
Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for… 
Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling
This work innovatively devise graph pooling algorithms to automatically explore the associations between various time slices from different modalities and learn high-level graph representation hierarchically and outperforms state-of-the-art models on three datasets under the same experimental setting.
Graph Capsule Aggregation for Unaligned Multimodal Sequences
This paper introduces Graph Capsule Aggregation (GraphCAGE) to model unaligned multimodal sequences with graph-based neural model and Capsule Network and suggests that GraphCAGE achieves state-of-the-art performance on two benchmark datasets with representations refined by Capsule network and interpretation provided.
Temporal Graph Convolutional Network for Multimodal Sentiment Analysis
This paper uses positional encoding by interleaving sine and cosine embedding to encode the positions of the segments in the utterances into their features and creates an attention mechanism corresponding to the segments to capture the sentiment-related ones and obtain the unified embeddings of utterances.
TEASEL: A Transformer-Based Speech-Prefixed Language Model
This work proposes a Transformer-Based Speech-Prefixed Language Model called TEASEL to approach the mentioned constraints without training a complete Transformer model and is 72% smaller than the SoTA model.


Multimodal Transformer for Unaligned Multimodal Language Sequences
Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.
Multimodal Language Analysis with Recurrent Multistage Fusion
The Recurrent Multistage Fusion Network (RMFN) is proposed which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion.
Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment
This work introduces a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data and demonstrates that the model outperforms state-of-the-art approaches on published datasets.
Learning Factorized Multimodal Representations
A model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors is introduced that demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance.
Multimodal sentiment analysis with word-level fusion and reinforcement learning
The Gated Multimodal Embedding LSTM with Temporal Attention model is proposed that is composed of 2 modules and able to perform modality fusion at the word level and is able to better model the multimodal structure of speech through time and perform better sentiment comprehension.
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation
A novel graph-based multi-modal fusion encoder for NMT that captures various semantic relationships between multi- modal semantic units (words and visual objects) and provides an attention-based context vector for the decoder.
Multimodal Neural Graph Memory Networks for Visual Question Answering
A new neural network architecture, Multimodal Neural Graph Memory Networks (MN-GMN), for visual question answering that rivals the state-of-the-art models on Visual7W, VQA-v2.0, and CLEVR datasets.
Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities
This paper proposes a method to learn robust joint representations by translating between modalities based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.
Audio-Visual Fusion for Sentiment Classification using Cross-Modal Autoencoder
This paper proposes a novel model combining deep canonical correlation analysis (DCCA) with cross-modal autoencoders that tries to reconstruct the representations corresponding to the missing modality, using the DCCA transformed representations of the available input modalities.
Recurrent Space-time Graph Neural Networks
This work proposes a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene and obtains state-of-the-art performance on the challenging Something-Something human-object interaction dataset.