Integrating Multimodal Information in Large Pretrained Transformers

  title={Integrating Multimodal Information in Large Pretrained Transformers},
  author={Wasifur Rahman and M. Hasan and Sangwu Lee and Amir Zadeh and Chengfeng Mao and Louis-Philippe Morency and Ehsan Hoque},
  journal={Proceedings of the conference. Association for Computational Linguistics. Meeting},
  • Wasifur RahmanM. Hasan E. Hoque
  • Published 1 July 2020
  • Computer Science
  • Proceedings of the conference. Association for Computational Linguistics. Meeting
Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on… 

Figures and Tables from this paper

Top-Down Attention in End-to-End Spoken Language Understanding

  • Yixin ChenWeiyi Lu Belinda Zeng
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features, which leads to a better optimization of both tasks.

CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis

The Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model, is proposed and significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT.

Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition

This work learns multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features that can improve the emotion recognition performance by up to 3% compared to the baseline.

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

A generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions and a weight-adjustment strat- egy to balance the learning progress among different sub- tasks to validate the re- liability and stability of auto-generated unimmodal supervi- sions.

Humor Knowledge Enriched Transformer for Understanding Multimodal Humor

This paper proposes Humor Knowledge enriched Transformer that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge, and incorporates humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language.

Span-based Localizing Network for Natural Language Video Localization

This work proposes a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL, and tackles the differences between NLVL and span- based QA through a simple and yet effective query-guided highlighting (QGH) strategy.

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

Results conducted on in-the-wild data indicate significant generalization in proposed models trained on missing cues, with gains up to 17% for frame level ablations, showing that these training strategies cope better with the loss of input modalities.

Weakly-supervised Multi-task Learning for Multimodal Affect Recognition

This paper explores three multimodal affect recognition tasks: 1) emotion recognition; 2) sentiment analysis; and 3) sarcasm recognition and suggests that weak supervision can provide a comparable contribution to strong supervision if the tasks are highly correlated.

Detecting Expressions with Multimodal Transformers

This study investigates deep-learning algorithms for audio-visual detection of user’s expression with significant improvements over models trained on single modalities and proposes the transformer architecture with encoder layers that better integrate audio- visual features for expressions tracking.

Multimodal End-to-End Sparse Model for Emotion Recognition

This paper develops a fully end-to-end model that connects the two phases and optimizes them jointly, and introduces a sparse cross-modal attention mechanism for the feature extraction.



Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

VideoBERT: A Joint Model for Video and Language Representation Learning

This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

Improving Language Understanding by Generative Pre-Training

The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, improving upon the state of the art in 9 out of the 12 tasks studied.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

This paper proposes a method to learn robust joint representations by translating between modalities based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.

Learning Factorized Multimodal Representations

A model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors is introduced that demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.

Multi-attention Recurrent Network for Human Communication Comprehension

The main strength of the model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent part called the Long-short Term Hybrid Memory (LSTHM).

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.