Corpus ID: 235446941

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

@article{Gabrys2021ImprovingTE,
  title={Improving the expressiveness of neural vocoding with non-affine Normalizing Flows},
  author={Adam Gabry's and Yunlong Jiao and V. Klimkov and Daniel Korzekwa and R. Barra-Chicote},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.08649}
}
This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible nonaffine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model… Expand

Tables from this paper

References

SHOWING 1-10 OF 40 REFERENCES
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
TLDR
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Expand
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
TLDR
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Expand
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
TLDR
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system. Expand
Efficient Neural Audio Synthesis
TLDR
A single-layer recurrent neural network with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model, the WaveRNN, and a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. Expand
FloWaveNet : A Generative Flow for Raw Audio
TLDR
FloWaveNet is proposed, a flow-based generative model for raw audio synthesis that requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. Expand
Universal Neural Vocoding with Parallel Wavenet
TLDR
It is shown that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall and performs better than several existing neural vocoder architectures in terms of naturalness and universality when tested on more than 300 open-source voices. Expand
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
Waveglow: A Flow-based Generative Network for Speech Synthesis
TLDR
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Expand
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
  • J. Valin, J. Skoglund
  • Computer Science, Engineering
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones. Expand
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that mapsExpand
...
1
2
3
4
...