Deep Cross-Modal Audio-Visual Generation

  title={Deep Cross-Modal Audio-Visual Generation},
  author={Lele Chen and Sudhanshu Srivastava and Zhiyao Duan and Chenliang Xu},
  journal={Proceedings of the on Thematic Workshops of ACM Multimedia 2017},
Cross-modal audio-visual perception has been a long-lasting topic in psychology and neurology, and various studies have discovered strong correlations in human perception of auditory and visual stimuli. [] Key Method We explore different encoding methods for audio and visual signals, and work on two scenarios: instrument-oriented generation and pose-oriented generation.

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

This paper proposes a Cross-Modal Cycle Generative Adversarial Network (CMCGAN) to handle cross-modal visual-audio mutual generation and develops a dynamic multimodal classification network to handle the modality missing problem.

Audio-to-Image Cross-Modal Generation

This work confirms the possibility to train variational autoencoders (VAEs) to reconstruct image archetypes from audio data and finds that there is a trade-off between the consistency and diversity of the generated images - this trade-offs can be governed by scaling the reconstruction loss up or down, respectively.

An Attention Enhanced Cross-Modal Image-Sound Mutual Generation Model for Birds

This work proposes an attention enhanced cross-modal cycle adversarial generation network that obtains promising performance and achieves significant improvement under both inception score and Frechet inception distance criteria.

Vision-Infused Deep Audio Inpainting

This work considers a new task of visual information-infused audio inpainting, i.e., synthesizing missing audio segments that correspond to their accompanying videos that are coherent with their video counterparts, showing the effectiveness of the proposed Vision-Infused Audio Inpainter (VIAI).

FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos

This research introduces a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task adapting the synchronicity traits between audiovisual modalities.

Spectrogram Analysis Via Self-Attention for Realizing Cross-Model Visual-Audio Generation

The post-experimental comparison shows that the Self-Attention module greatly improves the generation and classification of audio data and achieves results that are superior to existing cross-modal visual-audio generative models.

Exploiting Audio-Visual Consistency with Partial Supervision for Spatial Audio Generation

An audio spatialization framework to convert a monaural video into a binaural one exploiting the relationship across audio and visual components is proposed and can be viewed as a self-supervised learning technique, and alleviates the dependency on a large amount of video data with ground truth bINAural audio data during training.

Sound-to-Imagination: An Exploratory Study on Unsupervised Crossmodal Translation Using Diverse Audiovisual Data

Despite the complexity of the specified S2I translation task, the model was able to generalize the model enough to obtain more than 14%, in average, of interpretable and semantically coherent images translated from unknown sounds.

Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching

A novel Adversarial-Metric Learning (AML) model for audio-visual matching that generates a modality-independent representation for each person in each modality via adversarial learning, while simultaneously learns a robust similarity measure for cross-modality matching via metric learning.



SEGAN: Speech Enhancement Generative Adversarial Network

This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.

Multimodal Deep Learning

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

We introduce a dataset for facilitating audio-visual analysis of music performances. The dataset comprises 44 simple multi-instrument classical music pieces assembled from coordinated but separately

Creating A Musical Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications

A dataset for facilitating audio-visual analysis of musical performances comprising a number of simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks is introduced.

Improved Techniques for Training GANs

This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.

Adversarial Autoencoders

This paper shows how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization, and performed experiments on MNIST, Street View House Numbers and Toronto Face datasets.

Generative Adversarial Text to Image Synthesis

A novel deep architecture and GAN formulation is developed to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels.

Auditory-visual cross-modal perception phenomena

It is strongly suggested that the quality of realism in virtual environments must be a function of both auditory and visual display fidelities inclusive of each other.

Multimodal learning with deep Boltzmann machines

A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.