• Corpus ID: 14254027

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

@article{Mehri2017SampleRNNAU,
  title={SampleRNN: An Unconditional End-to-End Neural Audio Generation Model},
  author={Soroush Mehri and Kundan Kumar and Ishaan Gulrajani and Rithesh Kumar and Shubham Jain and Jose M. R. Sotelo and Aaron C. Courville and Yoshua Bengio},
  journal={ArXiv},
  year={2017},
  volume={abs/1612.07837}
}
In this paper we propose a novel model for unconditional audio generation task that generates one audio sample at a time. [] Key Result We also show how each component of the model contributes to the exhibited performance.

Figures and Tables from this paper

MelNet: A Generative Model for Audio in the Frequency Domain
TLDR
This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.
It's Raw! Audio Generation with State-Space Models
TLDR
SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling, is proposed, identifying that S4 can be unstable during autoregressive generation, and providing a simple improvement to its parameterization by drawing connections to Hurwitz matrices.
HybridNet: A Hybrid Neural Architecture to Speed-up Autoregressive Models
TLDR
This paper introduces HybridNet, a hybrid neural network to speed-up autoregressive models for raw audio waveform generation and yields state-of-art performance when applied to text-to-speech.
GoodBye WaveNet - A Language Model for Raw Audio with Context of 1/2 Million Samples
TLDR
This work proposes a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples, on a standard dataset for modeling long-term structure.
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning
TLDR
This work proposes a method for generating sounds via neural discrete time-frequency representation learning, conditioned on sound classes, which offers an advantage in efficiently modelling long-range dependencies and retaining local fine-grained structures within sound clips.
Multi-speaker Neural Vocoder
TLDR
This dissertation explores the possibilities of implementing an adaptation of the end-toend model SampleRNN conditioned to both speech parameters and speaker identity that allow an entire shared framework to be implemented in a speech synthesis system.
A general-purpose deep learning approach to model time-varying audio effects
TLDR
This work proposes a deep learning architecture for generic black-box modeling of audio processors with long-term memory based on convolutional and recurrent neural networks and proposes an objective metric based on the psychoacoustics of modulation frequency perception.
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
TLDR
These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM.
A Recurrent Latent Variable Model for Sequential Data
TLDR
It is argued that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech.
A Clockwork RNN
TLDR
This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate.
An Empirical Exploration of Recurrent Network Architectures
TLDR
It is found that adding a bias of 1 to the LSTM's forget gate closes the gap between the L STM and the recently-introduced Gated Recurrent Unit (GRU) on some but not all tasks.
Generating Sequences With Recurrent Neural Networks
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Learning Complex, Extended Sequences Using the Principle of History Compression
TLDR
A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences.
Pixel Recurrent Neural Networks
TLDR
A deep neural network is presented that sequentially predicts the pixels in an image along the two spatial dimensions and encodes the complete set of dependencies in the image to achieve log-likelihood scores on natural images that are considerably better than the previous state of the art.
Long Short-Term Memory
TLDR
A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
...
...