• Corpus ID: 3524525

Efficient Neural Audio Synthesis

@article{Kalchbrenner2018EfficientNA,
  title={Efficient Neural Audio Synthesis},
  author={Nal Kalchbrenner and Erich Elsen and Karen Simonyan and Seb Noury and Norman Casagrande and Edward Lockhart and Florian Stimberg and A{\"a}ron van den Oord and Sander Dieleman and Koray Kavukcuoglu},
  journal={ArXiv},
  year={2018},
  volume={abs/1802.08435}
}
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. [...] Key Method We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU.Expand
SpeedySpeech: Efficient Neural Speech Synthesis
TLDR
It is shown that self-attention layers are not necessary for generation of high quality audio and a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis is proposed, with low requirements on computational resources and fast training time.
SING: Symbol-to-Instrument Neural Generator
TLDR
This work presents a lightweight neural audio synthesizer trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms.
SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis
  • A. Mv, P. Ghosh
  • Computer Science
    IEEE Signal Processing Letters
  • 2020
TLDR
There is a significant reduction in the memory and computational complexity compared to the state-of-the-art speaker independent neural speech synthesizer without any loss of the naturalness of the synthesized speech.
Multi-Rate Attention Architecture for Fast Streamable Text-to-Speech Spectrum Modeling
TLDR
A multi-rate attention architecture is proposed that breaks the latency and RTF bottlenecks by computing a compact representation during encoding and recurrently generating the attention vector in a streaming manner during decoding, making it ideal for real-time applications.
Audio representations for deep learning in sound synthesis: A review
TLDR
This paper provides an overview of audio representations applied to sound synthesis using deep learning and presents the most significant methods for developing and evaluating a sound synthesis architecture using deeplearning models, always depending on the audio representation.
DurIAN: Duration Informed Attention Network for Speech Synthesis
TLDR
It is shown that proposed DurIAN system could generate highly natural speech that is on par with current state of the art end-to-end systems, while being robust and stable at the same time.
Parallel WaveNet conditioned on VAE latent vectors
TLDR
The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated.
Quasi-fully Convolutional Neural Network with Variational Inference for Speech Synthesis
  • Mu Wang, Xixin Wu, +6 authors H. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This work introduces a fully convolutional neural network (CNN) model, which can effiently run on parallel processers, for speech synthesis, and shows that CNNs with variational inference can generate highly natural speech on a par with end-to-end models.
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency
TLDR
Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real- time applications on both devices.
SignalTrain: Profiling Audio Compressors with Deep Neural Networks
TLDR
A data-driven approach for predicting the behavior of a given non-linear audio signal processing effect (henceforth "audio effect") using a deep auto-encoder model that is conditioned on both time-domain samples and the control parameters of the target audio effect.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Block-Sparse Recurrent Neural Networks
TLDR
Two different approaches to induce block sparsity in RNNs are investigated: pruning blocks of weights in a layer and using group lasso regularization with pruning to create blocks ofweights with zeros, which can create block-sparse RNN's with sparsity ranging from 80% to 90% with a small loss in accuracy.
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
TLDR
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
Neural Machine Translation in Linear Time
TLDR
The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.
Exploring Sparsity in Recurrent Neural Networks
TLDR
This work proposes a technique to reduce the parameters of a network by pruning weights during the initial training of the network, which reduces the size of the model and can also help achieve significant inference time speed-up using sparse matrix multiply.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
...
1
2
3
...