Corpus ID: 6254678

WaveNet: A Generative Model for Raw Audio

@article{Oord2016WaveNetAG,
  title={WaveNet: A Generative Model for Raw Audio},
  author={A{\"a}ron van den Oord and Sander Dieleman and Heiga Zen and Karen Simonyan and Oriol Vinyals and Alex Graves and Nal Kalchbrenner and Andrew W. Senior and Koray Kavukcuoglu},
  journal={ArXiv},
  year={2016},
  volume={abs/1609.03499}
}
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. [...] Key Result We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.Expand
A Generative Model for Raw Audio Using Transformer Architectures
TLDR
This paper proposes a deep neural network for generating waveforms, similar to wavenet, and shows how causal transformer generative models can be used for raw waveform synthesis, i.e. each sample generated depends on only the previously observed samples. Expand
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced. Expand
MelNet: A Generative Model for Audio in the Frequency Domain
TLDR
This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments. Expand
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms. Expand
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality. Expand
Vapar Synth - A Variational Parametric Model for Audio Synthesis
TLDR
Va-Par Synth - a Variational Parametric Synthesizer is presented which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation which demonstrates the model’s capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch. Expand
Parallel WaveNet conditioned on VAE latent vectors
TLDR
The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated. Expand
SynthNet: Learning to Synthesize Music End-to-End
TLDR
It is concluded that mappings between musical notes and the instrument timbre can be learned directly from the raw audio coupled with the musical score, in binary piano roll format. Expand
Variational Parametric Models for Audio Synthesis
With the advent of data-driven statistical modeling and abundant computing power, researchers are turning increasingly to deep learning for audio synthesis. These methods try to model audio signalsExpand
Conditional End-to-End Audio Transforms
TLDR
An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 68 REFERENCES
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions. Expand
A Deep Learning Approach to Data-driven Parameterizations for Statistical Parametric Speech Synthesis
TLDR
This paper creates an invertible, low-dimensional, noise-robust encoding of the Mel Log Spectrum by training a tapered Stacked Denoising Autoencoder (SDA), and investigates a data-driven parameterization technique that is designed for the specific requirements of synthesis. Expand
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Expand
Acoustic modeling with deep neural networks using raw time signal for LVCSR
TLDR
Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features. Expand
Statistical parametric speech synthesis using deep neural networks
TLDR
This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters. Expand
Modelling acoustic feature dependencies with artificial neural networks: Trajectory-RNADE
TLDR
A probabilistic neural network model of acoustic trajectories, trajectory RNADE, is introduced, able to capture the dependencies between acoustic features conditioned on the phonetic labels in order to obtain high quality synthetic speech. Expand
Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis
  • K. Tokuda, H. Zen
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
Experimental results show that the proposed approach can directly maximize the likelihood defined at the waveform domain. Expand
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
TLDR
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal. Expand
Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis
TLDR
This paper proposes postfilters to modify the MS utterance by utterance or segment by segment to make the MS of synthetic speech close to that of natural speech, applicable to various synthesizers based on statistical parametric speech synthesis. Expand
Directly modeling voiced and unvoiced components in speech waveforms by neural networks
  • K. Tokuda, H. Zen
  • Computer Science
  • 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
TLDR
A novel acoustic model based on neural networks for statistical parametric speech synthesis that can generate speech waveforms approximating natural speechWaveforms. Expand
...
1
2
3
4
5
...