• Corpus ID: 3697399

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

@inproceedings{Engel2017NeuralAS,
  title={Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders},
  author={Jesse Engel and Cinjon Resnick and Adam Roberts and Sander Dieleman and Mohammad Norouzi and Douglas Eck and Karen Simonyan},
  booktitle={ICML},
  year={2017}
}
Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. [] Key Method Second, we introduce NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets. Using NSynth, we demonstrate improved qualitative and quantitative performance of the WaveNet autoencoder over a well-tuned spectral autoencoder baseline. Finally, we show that the model learns a manifold…

Figures and Tables from this paper

SynthNet: Learning to Synthesize Music End-to-End
TLDR
It is concluded that mappings between musical notes and the instrument timbre can be learned directly from the raw audio coupled with the musical score, in binary piano roll format.
Vapar Synth - A Variational Parametric Model for Audio Synthesis
TLDR
Va-Par Synth - a Variational Parametric Synthesizer is presented which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation which demonstrates the model’s capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
Variational Parametric Models for Audio Synthesis
TLDR
This work presents VaPar Synth a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder trained on a suitable parametric representation and investigates a parametric model for violin tones, in particular the generative modeling of the residual bow noise.
Deep Performer: Score-to-Audio Music Performance Synthesis
TLDR
The Deep Performer is presented—a novel system for score-to-audio music performance synthesis that can synthesize music with clear polyphony and harmonic structures and significantly outperforms the baseline on an existing piano dataset in overall quality.
RAVE: A variational autoencoder for fast and high-quality neural audio synthesis
TLDR
This paper introduces a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis and introduces a novel two-stage training procedure, namely representation learning and adversarial fine-tuning.
Autoencoders for music sound synthesis: a comparison of linear, shallow, deep and variational models
TLDR
It is shown that PCA systematically outperforms shallow AE and that only a deep architecture (DAE) can lead to a lower reconstruction error, and that VAEs are still able to outperform PCA while providing a low-dimensional latent space with nice "usability" properties.
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
Neural Music Synthesis for Flexible Timbre Control
TLDR
A neural music synthesis model with flexible timbre controls, which consists of a recurrent neural network conditioned on a learned instrument embedding followed by a WaveNet vocoder, is described.
Assisted Sound Sample Generation with Musical Conditioning in Adversarial Auto-Encoders
TLDR
The proposed model generates notes as magnitude spectrograms from any probabilistic latent code samples, with expressive control of orchestral timbres and playing styles, and can be applied to other sound domains, including an user's libraries with custom sound tags that could be mapped to specific generative controls.
HpRNet : Incorporating Residual Noise Modeling for Violin in a Variational Parametric Synthesizer
TLDR
This work investigates a parametric model for violin tones, in particular the generative modeling of the residual bow noise to make for more natural tone quality.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Musical Audio Synthesis Using Autoencoding Neural Nets
TLDR
An interactive musi- cal audio synthesis system that uses feedforward artificial neural networks for musical audio synthesis, rather than discriminative or regression tasks, and allows one to interact directly with the parameters of the model and generate musical audio in real time.
Learning Features of Music from Scratch
TLDR
A multi-label classification task to predict notes in musical recordings is defined, along with an evaluation protocol, and several machine learning architectures for this task are benchmarked.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
Variational Lossy Autoencoder
TLDR
This paper presents a simple but principled method to learn global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN with greatly improve generative modeling performance of VAEs.
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
TLDR
This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.
Sound texture synthesis via filter statistics
TLDR
The results suggest that such statistical representations could underlie sound texture perception, and that the auditory system may use fairly simple statistics to recognize many natural sound textures.
PixelVAE: A Latent Variable Model for Natural Images
Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representation and model global structure well but have difficulty
A note on the evaluation of generative models
TLDR
This article reviews mostly known but often underappreciated properties relating to the evaluation and interpretation of generative models with a focus on image models and shows that three of the currently most commonly used criteria---average log-likelihood, Parzen window estimates, and visual fidelity of samples---are largely independent of each other when the data is high-dimensional.
The Synthesis of Complex Audio Spectra by Means of Frequency Modulation
A new application of the well-known process of frequency modulation is shown to result in a surprising control of audio spectra. The technique provides a means of great simplicity to control the
...
1
2
3
4
5
...