Multimodal Machine Learning: A Survey and Taxonomy

@article{Baltruaitis2019MultimodalML,
  title={Multimodal Machine Learning: A Survey and Taxonomy},
  author={Tadas Baltru{\vs}aitis and Chaitanya Ahuja and Louis-Philippe Morency},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2019},
  volume={41},
  pages={423-443}
}
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. [] Key Result This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

Figures and Tables from this paper

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

TLDR
This paper proposes a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends, and defines two key principles of modality heterogeneity and interconnections that have driven subsequent innovations.

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

TLDR
Improved the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community is sought by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities.

Multimodal Conversational AI: A Survey of Datasets and Approaches

TLDR
This paper motivates, defines, and mathematically formulates the multimodal conversational research objective, and provides a taxonomy of research required to solve the objective: multi-modality representation, fusion, alignment, translation, and co-learning.

Remote Sensing and Time Series Data Fused Multimodal Prediction Model Based on Interaction Analysis

TLDR
A scalable interaction model based on Squeeze-and-Excitation block is proposed to fuse imagemodality from remote sensing images and temporal modality from user visit sequence to solve a practical problem of urban functional area classification.

Exploring modality-agnostic representations for music classification

TLDR
This work explores the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality.

Multimodal Data Fusion with Quantum Inspiration

TLDR
The research on quantum-inspired multimodal data fusion is proposing to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures.

What is Multimodality?

TLDR
A new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task is proposed.

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

TLDR
This work focuses on unsupervised feature learning for Multimodal Emotion Recognition (MER), and considers discrete emotions, and as modalities text, audio and vision are used, which is the first attempt in MER literature.
...

References

SHOWING 1-10 OF 267 REFERENCES

Multimodal learning with deep Boltzmann machines

TLDR
A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

EmoNets: Multimodal deep learning approaches for emotion recognition in video

TLDR
This paper explores multiple methods for the combination of cues from these modalities into one common classifier, which achieves a considerably greater accuracy than predictions from the strongest single-modality classifier.

Multimodal Deep Learning

TLDR
This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Multimodal Transfer Deep Learning for Audio Visual Recognition

We propose a multimodal deep learning framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. For instance, we show that we can

Multimodal Dynamic Networks for Gesture Recognition

TLDR
It is demonstrated that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.

Co-Adaptation of audio-visual speech and gesture classifiers

TLDR
It is demonstrated that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio- visual modalities, and a co-adaptation algorithm is proposed, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia

On the Integration of Grounding Language and Learning Objects

TLDR
A multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input and incorporating the spatio-temporal and cross-modal constraints of multimodals.

Learning Representations for Multimodal Data with Deep Belief Nets

TLDR
The experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval.

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
...