A Review on Methods and Applications in Multimodal Deep Learning
@article{Summaira2022ARO, title={A Review on Methods and Applications in Multimodal Deep Learning}, author={Jabeen Summaira and Xi Li and Amin Muhammad Shoib and Jabbar Abdul}, journal={ArXiv}, year={2022}, volume={abs/2202.09195} }
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper…
References
SHOWING 1-10 OF 119 REFERENCES
Deep Multimodal Representation Learning: A Survey
- Computer ScienceIEEE Access
- 2019
The key issues of newly developed technologies, such as encoder-decoder model, generative adversarial networks, and attention mechanism in a multimodal representation learning perspective, which, to the best of the knowledge, have never been reviewed previously are highlighted.
Multimodal Machine Learning: A Survey and Taxonomy
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine Intelligence
- 2019
This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition
- Computer ScienceComput. Vis. Image Underst.
- 2018
Deep Spatio-Temporal Features for Multimodal Emotion Recognition
- Computer Science2017 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2017
A novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams is introduced.
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods
- Computer ScienceJ. Artif. Intell. Res.
- 2021
This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.
Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.
Image caption generation with dual attention mechanism
- Computer ScienceInf. Process. Manag.
- 2020
A Review of Deep Learning with Special Emphasis on Architectures, Applications and Recent Trends
- Computer ScienceArXiv
- 2019
This review seeks to present a refresher of the many different stacked, connectionist networks that make up the deep learning architectures followed by automatic architecture optimization protocols using multi-agent approaches and to provide a handy reference to researchers seeking to embrace deep learning in their work for what it is.
GLA: Global–Local Attention for Image Description
- Computer ScienceIEEE Transactions on Multimedia
- 2018
The proposed GLA method can generate more relevant image description sentences and achieve the state-of-the-art performance on the well-known Microsoft COCO caption dataset with several popular evaluation metrics.
Multi-Attention Generative Adversarial Network for image captioning
- Computer ScienceNeurocomputing
- 2020