• Corpus ID: 238253096

Neural Dependency Coding inspired Multimodal Fusion

  title={Neural Dependency Coding inspired Multimodal Fusion},
  author={Shiv Shankar},
  • Shiv Shankar
  • Published 28 September 2021
  • Computer Science
  • ArXiv
Information integration from different modalities is an active area of research. Human beings and, in general, biological neural systems are quite adept at using a multitude of signals from different sensory perceptive fields to interact with the environment and each other. Recent work in deep fusion models via neural networks has led to substantial improvements over unimodal approaches in areas like speech recognition, emotion recognition and analysis, captioning and image description. However… 

Tables from this paper

Progressive Fusion for Multimodal Integration

Progressive Fusion is presented, a model-agnostic technique which makes late stage fused representations avail- able to early layers through backward connections, improving the expressiveness of the representations.



Multi-attention Recurrent Network for Human Communication Comprehension

The main strength of the model comes from discovering interactions between modalities through time using a neural component called the Multi-attention Block (MAB) and storing them in the hybrid memory of a recurrent part called the Long-short Term Hybrid Memory (LSTHM).

Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

The Low-rank Multimodal Fusion method is proposed, which performs multimodal fusion using low-rank tensors to improve efficiency and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

This paper explores the use of modality-specific "BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition and demonstrates that a simple fusion mechanism can outperform more complex ones.

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

A novel framework, MISA, is proposed, which projects each modality to two distinct subspaces, which provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions.

Tensor Fusion Network for Multimodal Sentiment Analysis

A novel model, termed Tensor Fusion Networks, is introduced, which learns intra-modality and inter- modality dynamics end-to-end in sentiment analysis and outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

Factorized Multimodal Transformer for Multimodal Sequential Learning

A new transformer model, called the Factorized Multimodal Transformer (FMT), which inherently models the intramodal and intermodal dynamics within its multimodal input in a factorized manner and shows superior performance over previously proposed models.

Multimodal Machine Learning: A Survey and Taxonomy

This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

Integrating Multimodal Information in Large Pretrained Transformers

Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine- Tuning of BERT and XLNet.

Multimodal Transformer for Unaligned Multimodal Language Sequences

Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.