• Corpus ID: 246996526

A Review on Methods and Applications in Multimodal Deep Learning

  title={A Review on Methods and Applications in Multimodal Deep Learning},
  author={Jabeen Summaira and Xi Li and Amin Muhammad Shoib and Jabbar Abdul},
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper… 


Deep Multimodal Representation Learning: A Survey
The key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of the knowledge, have never been reviewed previously are highlighted.
Multimodal Machine Learning: A Survey and Taxonomy
This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
Deep Spatio-Temporal Features for Multimodal Emotion Recognition
A novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams is introduced.
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering
A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.
Image caption generation with dual attention mechanism
A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends
This review seeks to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches and to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is.
GLA: Global–Local Attention for Image Description
The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Multi-Attention Generative Adversarial Network for image captioning