High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder

  title={High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder},
  author={Kazi Nazmul Haque and Rajib Kumar Rana and Bj{\"o}rn W. Schuller},
  journal={IEEE Access},
Generating high-fidelity conditional audio samples and learning representation from unlabelled audio data are two challenging problems in machine learning research. Recent advances in the Generative Adversarial Neural Networks (GAN) architectures show great promise in addressing these challenges. To learn powerful representation using GAN architecture, it requires superior sample generation quality, which requires an enormous amount of labelled data. In this paper, we address this issue by… 
A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions
This paper attempts to provide an overview of various composition tasks under different music generation levels, covering most of the currently popular music generation tasks using deep learning.
FoleyGAN: Visually Guided Generative Adversarial Network-Based Synchronous Sound Generation in Silent Videos
This research introduces a novel task of guiding a class conditioned generative adversarial network with the temporal visual information of a video input for visual to sound generation task adapting the synchronicity traits between audiovisual modalities.
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification
An adversarial autoencoder (AAE) is proposed to replace generative adversarial network (GAN) in private aggregation of teacher ensembles (PATE) and guarantee ε-differential privacy (DP) on its derived classifiers to ensure differential privacy in speech applications.


Large Scale Adversarial Representation Learning
This work builds upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator, and demonstrates that these generation-based models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.
High-Fidelity Image Generation With Fewer Labels
This work demonstrates how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting.
Adversarial Generation of Time-Frequency Features with application in audio synthesis
The potential of deliberate generative TF modeling is demonstrated by training a generative adversarial network (GAN) on short-time Fourier features and it is shown that by applying guidelines, the TF-based network was able to outperform a state-of-the-art GAN generating waveforms directly, despite the similar architecture in the two networks.
Large Scale GAN Training for High Fidelity Natural Image Synthesis
It is found that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input.
Adversarial Autoencoders
This paper shows how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization, and performed experiments on MNIST, Street View House Numbers and Toronto Face datasets.
GANSynth: Adversarial Neural Audio Synthesis
Through extensive empirical investigations on the NSynth dataset, it is demonstrated that GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.
Adversarial Audio Synthesis
WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization
  • Wei-Ning Hsu, Yu Zhang, James R. Glass
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.
Adversarial Auto-Encoders for Speech Based Emotion Recognition
The promise of adversarial autoencoders is demonstrated with regards to their ability to encode high dimensional feature vector representations for emotional utterances into a compressed space, and their able to regenerate synthetic samples in the original feature space, to be later used for purposes such as training emotion recognition classifiers.
Improved Techniques for Training GANs
This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.