Corpus ID: 14254027

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

@article{Mehri2017SampleRNNAU,
  title={SampleRNN: An Unconditional End-to-End Neural Audio Generation Model},
  author={Soroush Mehri and Kundan Kumar and Ishaan Gulrajani and Rithesh Kumar and Shubham Jain and Jose M. R. Sotelo and Aaron C. Courville and Yoshua Bengio},
  journal={ArXiv},
  year={2017},
  volume={abs/1612.07837}
}
In this paper we propose a novel model for unconditional audio generation task that generates one audio sample at a time. [...] Key Result We also show how each component of the model contributes to the exhibited performance.Expand
HybridNet: A Hybrid Neural Architecture to Speed-up Autoregressive Models
TLDR
This paper introduces HybridNet, a hybrid neural network to speed-up autoregressive models for raw audio waveform generation and yields state-of-art performance when applied to text-to-speech. Expand
MelNet: A Generative Model for Audio in the Frequency Domain
TLDR
This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments. Expand
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features. Expand
Multi-speaker Neural Vocoder
TLDR
This dissertation explores the possibilities of implementing an adaptation of the end-toend model SampleRNN conditioned to both speech parameters and speaker identity that allow an entire shared framework to be implemented in a speech synthesis system. Expand
A general-purpose deep learning approach to model time-varying audio effects
TLDR
This work proposes a deep learning architecture for generic black-box modeling of audio processors with long-term memory based on convolutional and recurrent neural networks and proposes an objective metric based on the psychoacoustics of modulation frequency perception. Expand
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms. Expand
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Excitation-by-SampleRNN Model for Text-to-Speech
TLDR
A neural vocoder-based text-to-speech (TTS) system that effectively utilizes a source-filter modeling framework that needs to generate only the glottal movement of human production mechanism to obtain high-quality speech signals using a small-size of the pitch interval-oriented SampleRNN network. Expand
Catch-A-Waveform: Learning to Generate Audio from a Single Short Example
TLDR
It is illustrated that capturing the essence of an audio source is typically possible from as little as a few tens of seconds from a single training signal, using a GAN-based generative model that can be trained on one short audio signal from any domain and does not require pre-training or any other form of external supervision. Expand
AAT: An Efficient Model for Synthesizing Long Sequences on a Small Dataset
TLDR
Experimental results show that the proposed Adaptive Alignment Tacotron (AAT) model achieves faster convergence speed and higher stability than the baseline model and open a feasible approach for speech synthesis on languages with small dataset. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
TLDR
These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM. Expand
A Recurrent Latent Variable Model for Sequential Data
TLDR
It is argued that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. Expand
Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models
TLDR
The recently proposed hierarchical recurrent encoder-decoder neural network is extended to the dialogue domain, and it is demonstrated that this model is competitive with state-of-the-art neural language models and back-off n-gram models. Expand
A Clockwork RNN
TLDR
This paper introduces a simple, yet powerful modification to the simple RNN architecture, the Clockwork RNN (CW-RNN), in which the hidden layer is partitioned into separate modules, each processing inputs at its own temporal granularity, making computations only at its prescribed clock rate. Expand
An Empirical Exploration of Recurrent Network Architectures
TLDR
It is found that adding a bias of 1 to the LSTM's forget gate closes the gap between the L STM and the recently-introduced Gated Recurrent Unit (GRU) on some but not all tasks. Expand
Generating Sequences With Recurrent Neural Networks
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approachExpand
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learningExpand
Learning Complex, Extended Sequences Using the Principle of History Compression
TLDR
A simple principle for reducing the descriptions of event sequences without loss of information is introduced and this insight leads to the construction of neural architectures that learn to divide and conquer by recursively decomposing sequences. Expand
Pixel Recurrent Neural Networks
TLDR
A deep neural network is presented that sequentially predicts the pixels in an image along the two spatial dimensions and encodes the complete set of dependencies in the image to achieve log-likelihood scores on natural images that are considerably better than the previous state of the art. Expand
...
1
2
3
4
...