Multi-Modal Deep Analysis for Multimedia

  title={Multi-Modal Deep Analysis for Multimedia},
  author={Wenwu Zhu and Xin Wang and Hongzhi Li},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for… 

On the Fusion of Multiple Audio Representations for Music Genre Classification

This work showed an exploratory study on different neural network model fusion techniques for music genre classification with multiple features as input and demonstrated that Multi-Feature Fusion Networks consistently improve the classification accuracy for suitable choices of input representations.

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

This work attempts to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features by implementing a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN).

Exploring the Benefits of Cross-Modal Coding

Numerical results demonstrate that the proposed cross-modal coding can achieve significant benefits relative to the existing schemes, especially when multi- modal signals have strong semantic correlation.

A Variational Inference Method for Few-Shot Learning

A novel two-generation based Latent Feature Augmentation and Distribution Regularization framework (LFADR) including prior relation net (PRN) and vae-based posterior relationNet (VPORN) to generate a more robust VPORN based on PRN by transferring the prior knowledge in FSL is proposed.

MDFNet: application of multimodal fusion method based on skin image and clinical data to skin cancer classification

MDFNet can not only be applied as an effective auxiliary diagnostic tool for skin cancer diagnosis, help physicians improve clinical decision-making ability and effectively improve the efficiency of clinical medicine diagnosis, but also its proposed data fusion method fully exerts the advantage of information convergence and has a certain reference value for the intelligent diagnosis of numerous clinical diseases.

Video Grounding and Its Generalization

This tutorial will give a detailed introduction about the development and evolution of this task, point out the limitations of existing benchmarks, and extend such a text-based grounding task to more general scenarios, especially how it guides the learning of other video-language tasks like video question answering based on event grounding.

A review on video summarization techniques

Implementation of Short Video Click-Through Rate Estimation Model Based on Cross-Media Collaborative Filtering Neural Network

By directly extracting the image features, behavioral features, and audio features of short videos as video feature representation, more video information is considered than other models and the proposed model improves in AUC, accuracy, and log loss metrics.

A Comprehensive Report on Machine Learning-based Early Detection of Alzheimer's Disease using Multi-modal Neuroimaging Data

A variety of feature selection, scaling, and fusion methodologies along with confronted challenges are elaborated for designing an ML-based AD diagnosis system based on multi-modal neuroimaging data from patients with AD.



Cross-Domain Collaborative Learning in Social Multimedia

This work proposes a generic Cross-Domain Collaborative Learning (CDCL) framework based on non-parametric Bayesian dictionary learning model for cross-domain data analysis that can effectively explore the virtues of different information sources to complement and enhance each other forCross- domain data analysis.

Combining modality specific deep neural networks for emotion recognition in video

In this paper we present the techniques used for the University of Montréal's team submissions to the 2013 Emotion Recognition in the Wild Challenge. The challenge is to classify the emotions

Cross-Platform Multi-Modal Topic Modeling for Personalized Inter-Platform Recommendation

Qualitative and quantitative evaluation results validate the effectiveness of the proposed cross- platform multi-modal topic model (CM3TM) and demonstrate the advantage of connecting different platforms with different modalities for the inter-platform recommendation.

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering

A Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi- modal features, which results in superior performance for VQA compared with other bilinear pooling approaches.

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

  • Ting YaoTao MeiY. Rui
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
A novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between high-light and non-highlight video segments is proposed and achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy.

Video Summarization by Learning Deep Side Semantic Embedding

A novel deep side semantic embedding (DSSE) model is presented to generate video summaries by leveraging the freely available side information and the superior performance of DSSE is demonstrated to the several state-of-the-art approaches to video summarization.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Multimodal fusion using dynamic hybrid models

A novel hybrid model is proposed that exploits the strength of discriminative classifiers along with the representational power of generative models to solve the challenge of detecting multimodal events in time varying sequences.