Multi-Modal Deep Analysis for Multimedia

  title={Multi-Modal Deep Analysis for Multimedia},
  author={Wenwu Zhu and Xin Wang and Hongzhi Li},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for… 

On the Fusion of Multiple Audio Representations for Music Genre Classification

This work showed an exploratory study on different neural network model fusion techniques for music genre classification with multiple features as input and demonstrated that Multi-Feature Fusion Networks consistently improve the classification accuracy for suitable choices of input representations.

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

This work attempts to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features by implementing a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN).

Multimedia Intelligence: When Multimedia Meets Artificial Intelligence

  • Wenwu ZhuXin WangW. Gao
  • Computer Science
    IEEE Transactions on Multimedia
  • 2020
The concept of Multimedia Intelligence is introduced through investigating the mutual-influence between multimedia and Artificial Intelligence and how efforts have been done in literature and insights on research directions that deserve further study are shared.

Implementation of Short Video Click-Through Rate Estimation Model Based on Cross-Media Collaborative Filtering Neural Network

By directly extracting the image features, behavioral features, and audio features of short videos as video feature representation, more video information is considered than other models and the proposed model improves in AUC, accuracy, and log loss metrics.

MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification

MDFNet can not only be applied as an effective auxiliary diagnostic tool for skin cancer diagnosis, help physicians improve clinical decision-making ability and effectively improve the efficiency of clinical medicine diagnosis, but also its proposed data fusion method fully exerts the advantage of information convergence and has a certain reference value for the intelligent diagnosis of numerous clinical diseases.

A Comprehensive Report on Machine Learning-based Early Detection of Alzheimer's Disease using Multi-modal Neuroimaging Data

A variety of feature selection, scaling, and fusion methodologies along with confronted challenges are elaborated for designing an ML-based AD diagnosis system based on multi-modal neuroimaging data from patients with AD.

Video Grounding and Its Generalization

This tutorial will give a detailed introduction about the development and evolution of this task, point out the limitations of existing benchmarks, and extend such a text-based grounding task to more general scenarios, especially how it guides the learning of other video-language tasks like video question answering based on event grounding.



Cross-Domain Collaborative Learning in Social Multimedia

This work proposes a generic Cross-Domain Collaborative Learning (CDCL) framework based on non-parametric Bayesian dictionary learning model for cross-domain data analysis that can effectively explore the virtues of different information sources to complement and enhance each other forCross- domain data analysis.

Multimodal learning with deep Boltzmann machines

A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

Combining modality specific deep neural networks for emotion recognition in video

In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions

Cross-Platform Multi-Modal Topic Modeling for Personalized Inter-Platform Recommendation

Qualitative and quantitative evaluation results validate the effectiveness of the proposed cross- platform multi-modal topic model (CM3TM) and demonstrate the advantage of connecting different platforms with different modalities for the inter-platform recommendation.

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

  • Ting YaoTao MeiY. Rui
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
A novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between high-light and non-highlight video segments is proposed and achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy.

Video Summarization by Learning Deep Side Semantic Embedding

A novel deep side semantic embedding (DSSE) model is presented to generate video summaries by leveraging the freely available side information and the superior performance of DSSE is demonstrated to the several state-of-the-art approaches to video summarization.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.