• Corpus ID: 246996526

A Review on Methods and Applications in Multimodal Deep Learning

@article{Summaira2022ARO,
  title={A Review on Methods and Applications in Multimodal Deep Learning},
  author={Jabeen Summaira and Xi Li and Amin Muhammad Shoib and Jabbar Abdul},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.09195}
}
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper… 

References

SHOWING 1-10 OF 119 REFERENCES
Deep Multimodal Representation Learning: A Survey
TLDR
The key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of the knowledge, have never been reviewed previously are highlighted.
Multimodal Machine Learning: A Survey and Taxonomy
TLDR
This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
Deep Spatio-Temporal Features for Multimodal Emotion Recognition
TLDR
A novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams is introduced.
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
TLDR
This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering
TLDR
A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.
Image caption generation with dual attention mechanism
A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends
TLDR
This review seeks to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches and to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is.
GLA: Global–Local Attention for Image Description
TLDR
The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Multi-Attention Generative Adversarial Network for image captioning
...
...