Multimodal Machine Learning: A Survey and Taxonomy

  title={Multimodal Machine Learning: A Survey and Taxonomy},
  author={Tadas Baltru{\vs}aitis and Chaitanya Ahuja and Louis-Philippe Morency},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. [] Key Result This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

Figures and Tables from this paper

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

This paper proposes a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends, and defines two key principles of modality heterogeneity and interconnections that have driven subsequent innovations.

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Improved the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community is sought by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities.

Multimodal Conversational AI: A Survey of Datasets and Approaches

This paper motivates, defines, and mathematically formulates the multimodal conversational research objective, and provides a taxonomy of research required to solve the objective: multi-modality representation, fusion, alignment, translation, and co-learning.

Remote Sensing and Time Series Data Fused Multimodal Prediction Model Based on Interaction Analysis

A scalable interaction model based on Squeeze-and-Excitation block is proposed to fuse imagemodality from remote sensing images and temporal modality from user visit sequence to solve a practical problem of urban functional area classification.

Exploring modality-agnostic representations for music classification

This work explores the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality.

Multimodal Data Fusion with Quantum Inspiration

The research on quantum-inspired multimodal data fusion is proposing to employ superposition to formulate intra-modal interactions while the interplay between different modalities is expected to be captured by entanglement measures.

What is Multimodality?

A new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task is proposed.

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

This work focuses on unsupervised feature learning for Multimodal Emotion Recognition (MER), and considers discrete emotions, and as modalities text, audio and vision are used, which is the first attempt in MER literature.



Multimodal learning with deep Boltzmann machines

A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

EmoNets: Multimodal deep learning approaches for emotion recognition in video

This paper explores multiple methods for the combination of cues from these modalities into one common classifier, which achieves a considerably greater accuracy than predictions from the strongest single-modality classifier.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Multimodal Transfer Deep Learning for Audio Visual Recognition

We propose a multimodal deep learning framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. For instance, we show that we can

Multimodal Dynamic Networks for Gesture Recognition

It is demonstrated that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.

Co-Adaptation of audio-visual speech and gesture classifiers

It is demonstrated that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio- visual modalities, and a co-adaptation algorithm is proposed, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia

On the Integration of Grounding Language and Learning Objects

A multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input and incorporating the spatio-temporal and cross-modal constraints of multimodals.

Learning Representations for Multimodal Data with Deep Belief Nets

The experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval.

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.