Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

@article{Yamamoto2020ParallelWA,
  title={Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram},
  author={Ryuichi Yamamoto and Eunwoo Song and Jae-Min Kim},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={6199-6203}
}
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the… 

Figures and Tables from this paper

Improved Parallel Wavegan Vocoder with Perceptually Weighted Spectrogram Loss
TLDR
A spectral-domain perceptual weighting technique for Parallel WaveGAN-based text-to-speech (TTS) systems that penalizes perceptually-sensitive errors in the frequency domain and is optimized toward reducing auditory noise in the synthesized speech.
TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis.
TLDR
TFGAN is a novel vocoder model which is adversarially learned both in time and frequency domain and shows the ability to achieve comparable mean opinion score (MOS) than autoregressive vocoder under speech synthesis context.
Parallel Waveform Synthesis Based on Generative Adversarial Networks with Voicing-Aware Conditional Discriminators
TLDR
This framework adopts a projection-based conditioning method that can significantly improve the discriminator’s performance, and separates the conventional discriminator into two waveform discriminators for modeling voiced and unvoiced speech.
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks
TLDR
A lightweight end-to-end text-tospeech model that can generate high-quality speech at breakneck speed and jointly trains the prosodic embedding network with the speech waveform generation task using an effective domain transfer technique is proposed.
StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
TLDR
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity, and MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
TLDR
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.
Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains
TLDR
The proposed Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input and showed superior performance in unseen domains with regard of speaker, emotion, and language.
A Spectral Energy Distance for Parallel Speech Synthesis
TLDR
This work proposes a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function, based on a generalized energy distance between the distributions of the generated and real audio.
WaveFlow: A Compact Flow-based Model for Raw Audio
TLDR
WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.
Avocodo: Generative Adversarial Network for Artifact-free Vocoder
TLDR
This paper proposes a GAN-based neural vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts and introduces two kinds of discriminators to evaluate waveforms in various perspectives: a collaborative multi- band discriminator and a sub-band discriminator.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
TLDR
This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems that outperform both those using conventional approaches, and also autoregressive generation systems with a well-trained teacher WaveNet.
Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder
TLDR
This paper proposes an end-to-end adaptation method based on the generative adversarial network (GAN), which can reduce the computational cost for the training of new speaker adaptation and can further reduce the quality gap between generated and natural waveforms.
Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis
TLDR
A new method for predicting glottal waveforms by generative adversarial networks (GANs) is proposed, and the newly proposed GANs achieve synthesis quality comparable to that of widely-used DNNs, without using an additive noise component.
SEGAN: Speech Enhancement Generative Adversarial Network
TLDR
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis
TLDR
This paper builds a framework in which new vocoding and acoustic modeling techniques with conventional approaches are compared by means of a large scale crowdsourced evaluation, and shows that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best.
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
Adversarial Audio Synthesis
TLDR
WaveGAN is a first attempt at applying GANs to unsupervised synthesis of raw-waveform audio, capable of synthesizing one second slices of audio waveforms with global coherence, suitable for sound effect generation.
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech
TLDR
The first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end- to-end training from scratch is introduced, which significantly outperforms the previous pipeline that connects a text-To-spectrogram model to a separately trained WaveNet.
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
...
...