• Corpus ID: 3697399

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

  title={Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders},
  author={Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan},
  booktitle={International Conference on Machine Learning},
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. [] Key Method Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold…

Figures and Tables from this paper

SynthNet: Learning to Synthesize Music End-to-End

It is concluded that mappings between musical notes and the instrument timbre can be learned directly from the raw audio coupled with the musical score, in binary piano roll format.

Vapar Synth - A Variational Parametric Model for Audio Synthesis

Va-Par Synth - a Variational Parametric Synthesizer is presented which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation which demonstrates the model’s capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.

Variational Parametric Models for Audio Synthesis

This work presents VaPar Synth a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder trained on a suitable parametric representation and investigates a parametric model for violin tones, in particular the generative modeling of the residual bow noise.

Deep Performer: Score-to-Audio Music Performance Synthesis

The Deep Performer is presented—a novel system for score-to-audio music performance synthesis that can synthesize music with clear polyphony and harmonic structures and significantly outperforms the baseline on an existing piano dataset in overall quality.

Multi-instrument Music Synthesis with Spectrogram Diffusion

This work compares training the decoder as an autoregressive model and as a Denoising Diffusion Probabilistic Model (DDPM) and finds that the DDPM approach is superior both qualita-tively and as measured by audio reconstruction and Fréchet distance metrics.

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

This paper introduces a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis and introduces a novel two-stage training procedure, namely representation learning and adversarial fine-tuning.

Autoencoders for music sound synthesis: a comparison of linear, shallow, deep and variational models

It is shown that PCA systematically outperforms shallow AE and that only a deep architecture (DAE) can lead to a lower reconstruction error, and that VAEs are still able to outperform PCA while providing a low-dimensional latent space with nice "usability" properties.

Generating Detailed Music Datasets with Neural Audio Synthesis

This work uses a generative model of MIDI (Coconet trained on Bach Chorales) with a structured audio synthesis model (MIDI-DDSP trained on URMP) to generate a system capable of producing unlimited amounts of realistic chorales with rich annotations through controlled synthesis of MIDI through generative models.

SING: Symbol-to-Instrument Neural Generator

This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.

Neural Music Synthesis for Flexible Timbre Control

A neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder, is described.



Musical Audio Synthesis Using Autoencoding Neural Nets

An interactive musi- cal audio synthesis system that uses feedforward artificial neural networks for musical audio synthesis, rather than discriminative or regression tasks, and allows one to interact directly with the parameters of the model and generate musical audio in real time.

Learning Features of Music from Scratch

A multi-label classification task to predict notes in musical recordings is defined, along with an evaluation protocol, and several machine learning architectures for this task are benchmarked.

Variational Lossy Autoencoder

This paper presents a simple but principled method to learn global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN with greatly improve generative modeling performance of VAEs.

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

Sound texture synthesis via filter statistics

The results suggest that such statistical representations could underlie sound texture perception, and that the auditory system may use fairly simple statistics to recognize many natural sound textures.

PixelVAE: A Latent Variable Model for Natural Images

Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representation and model global structure well but have difficulty

A note on the evaluation of generative models

This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models and shows that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when the data is high-dimensional.

Sampling Generative Networks: Notes on a Few Effective Techniques

Several techniques for effectively sampling and visualizing the latent spaces of generative models are introduced and two new techniques for deriving attribute vectors are demonstrated: bias-corrected vectors with data replication and synthetic vectors withData augmentation.

Improved Techniques for Training GANs

This work focuses on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic, and presents ImageNet samples with unprecedented resolution and shows that the methods enable the model to learn recognizable features of ImageNet classes.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.