• Corpus ID: 237492099

TEASEL: A Transformer-Based Speech-Prefixed Language Model

  title={TEASEL: A Transformer-Based Speech-Prefixed Language Model},
  author={Mehdi Arjmand and Mohammad Javad Dousti and Hadi Moradi},
Multimodal language analysis is a burgeoning field of NLP that aims to simultaneously model a speaker’s words, acoustical annotations, and facial expressions. In this area, lexicon features usually outperform other modalities because they are pre-trained on large corpora via Transformer-based models. Despite their strong performance, training a new self-supervised learning (SSL) Transformer on any modality is not usually attainable due to insufficient data, which is the case in multimodal… 

Figures and Tables from this paper

Multi-Modal Sentiment Analysis Based on Interactive Attention Mechanism

An optimized BERT model that is composed of three modules: the Hierarchical Multi-head Self Attention module realizes the hierarchical extraction process of the features; the Gate Channel module replaces BERT’s original Feed-Forward layer to realize information filtering; the tensor fusion model based on self-attention mechanism utilized to implement the fusion process of different modal features.

A Optimized BERT for Multimodal Sentiment Analysis

A Hierarchical multi-head Self Attention and Gate Channel BERT which is an optimized BERT model which achieves promising results and improves the accuracy by 5-6% when compared with traditional models on CMU-MOSI dataset.

Learning modality-fused representation based on transformer for emotion analysis

This work employs a popular and widely used multimodal factorized high-order pooling mechanism to obtain a more distinguishable feature representation and shows superiority in both word-aligned and unaligned settings.



Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.

Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition

This work learns multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features that can improve the emotion recognition performance by up to 3% compared to the baseline.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Multimodal sentiment analysis with word-level fusion and reinforcement learning

The Gated Multimodal Embedding LSTM with Temporal Attention model is proposed that is composed of 2 modules and able to perform modality fusion at the word level and is able to better model the multimodal structure of speech through time and perform better sentiment comprehension.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences

By learning to focus only on the important interactions within the graph, the proposed MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities

This paper proposes a method to learn robust joint representations by translating between modalities based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input.

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

A novel model, the Interaction Canonical Correlation Network (ICCN), is proposed, which learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms.