Multimodal Machine Learning: A Survey and Taxonomy

  title={Multimodal Machine Learning: A Survey and Taxonomy},
  author={Tadas Baltru{\vs}aitis and Chaitanya Ahuja and Louis-Philippe Morency},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. [] Key Result This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

Figures and Tables from this paper

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

This paper proposes a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends, and defines two key principles of modality heterogeneity and interconnections that have driven subsequent innovations.

A Short Survey on Deep Learning for Multimodal Integration: Applications, Future Perspectives and Challenges

A short survey on multimodal integration using deep-learning methods, and comprehensively review the concept of multimodality, describing it from a two-dimensional perspective.

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Improved the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community is sought by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities.

Vision+X: A Survey on Multimodal Learning in the Light of Data

This paper analyzes the commonness and uniqueness of each data format ranging from vision, audio, text and others, and presents the technical development categorized by the combination of Vision+X, where the vision data play a fundamental role in most multimodal learning works.

Multimodal Conversational AI: A Survey of Datasets and Approaches

This paper motivates, defines, and mathematically formulates the multimodal conversational research objective, and provides a taxonomy of research required to solve the objective: multi-modality representation, fusion, alignment, translation, and co-learning.

Remote Sensing and Time Series Data Fused Multimodal Prediction Model Based on Interaction Analysis

A scalable interaction model based on Squeeze-and-Excitation block is proposed to fuse imagemodality from remote sensing images and temporal modality from user visit sequence to solve a practical problem of urban functional area classification.

Multimodality in Meta-Learning: A Comprehensive Survey

Exploring modality-agnostic representations for music classification

This work explores the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality.



Multimodal learning with deep Boltzmann machines

A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

EmoNets: Multimodal deep learning approaches for emotion recognition in video

This paper explores multiple methods for the combination of cues from these modalities into one common classifier, which achieves a considerably greater accuracy than predictions from the strongest single-modality classifier.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

ModDrop: Adaptive Multi-Modal Gesture Recognition

The proposed ModDrop training technique ensures robustness of the classifier to missing signals in one or several channels to produce meaningful predictions from any number of available modalities, and demonstrates the applicability of the proposed fusion scheme to modalities of arbitrary nature by experiments on the same dataset augmented with audio.

Multimodal Transfer Deep Learning for Audio Visual Recognition

We propose a multimodal deep learning framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. For instance, we show that we can

Multimodal Dynamic Networks for Gesture Recognition

It is demonstrated that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.

Co-Adaptation of audio-visual speech and gesture classifiers

It is demonstrated that multimodal co-training can be used to learn from only a few labeled examples in one or both of the audio- visual modalities, and a co-adaptation algorithm is proposed, which adapts existing audio-visual classifiers to a particular user or noise condition by leveraging the redundancy in the unlabeled data.

Multimodal fusion for multimedia analysis: a survey

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia

On the Integration of Grounding Language and Learning Objects

A multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input and incorporating the spatio-temporal and cross-modal constraints of multimodals.

Learning Representations for Multimodal Data with Deep Belief Nets

The experimental results on bi-modal data consisting of images and text show that the Multimodal DBN can learn a good generative model of the joint space of image and text inputs that is useful for lling in missing data so it can be used both for image annotation and image retrieval.