Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

@article{Yamamoto2020ParallelWA,
  title={Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram},
  author={Ryuichi Yamamoto and Eunwoo Song and Jae-Min Kim},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={6199-6203}
}
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the… 

Figures and Tables from this paper

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation

TLDR
The evaluation shows that SawSing converges much faster and outperforms state-of-the-art generative adversarial network- and diffusion-based vocoders in a resource-limited scenario with only 3 training recordings and a 3-hour training time.

Parallel Synthesis for Autoregressive Speech Generation

TLDR
Compared with the baseline autoregressive and non-autoregressive models, the proposed model achieves better MOS and shows its good generalization ability while synthesizing 44 kHz speech or utterances from unseen speakers.

Vocbench: A Neural Vocoder Benchmark for Speech Synthesis

TLDR
VocBench is presented, a framework that benchmark the performance of state-of-the-art neural vocoders and demonstrates that the framework can show competitive efficacy and quality of the synthesized samples for each vocoder.

High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling

TLDR
The experimental results demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions, while allowing for real-time low-latency processing using a single core of ∼ 2.1–2.7 GHz CPU with ∼ 0.57–0.64 real- time factor.

Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components

TLDR
Experiments using a singing voice corpus show that the proposed structure improves the naturalness of the generated waveform, and the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion

TLDR
Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility.

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

TLDR
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity, and MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

TLDR
This paper improves the original MelGAN by increasing the receptive field of the generator and substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech.

The HITSZ TTS system for Blizzard challenge 2020

TLDR
The techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020 are presented and the evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance.

WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU

TLDR
WG-WaveNet is composed of a compact flow-based model and a post-filter that requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post- filter maintains the quality of generated waveform.
...

References

SHOWING 1-10 OF 31 REFERENCES

Probability density distillation with generative adversarial networks for high-quality parallel waveform generation

TLDR
This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems that outperform both those using conventional approaches, and also autoregressive generation systems with a well-trained teacher WaveNet.

On the Variance of the Adaptive Learning Rate and Beyond

TLDR
This work identifies a problem of the adaptive learning rate, suggests warmup works as a variance reduction technique, and proposes RAdam, a new variant of Adam, by introducing a term to rectify the variance of theadaptive learning rate.

FastSpeech: Fast, Robust and Controllable Text to Speech

TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.

Deconvolution and Checkerboard Artifacts

Neural Speech Synthesis with Transformer Network

TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder

TLDR
This paper proposes an end-to-end adaptation method based on the generative adversarial network (GAN), which can reduce the computational cost for the training of new speaker adaptation and can further reduce the quality gap between generated and natural waveforms.

ExcitNet Vocoder: A Neural Excitation Model for Parametric Speech Synthesis Systems

TLDR
Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet Vocoders.

Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language

TLDR
The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.

NSML: Meet the MLaaS platform with a real-world case study

TLDR
This work proposed NSML, a machine learning as a service (MLaaS) platform, which helps machine learning work be easily launched on a NSML cluster and provides a collaborative environment which can afford development at enterprise scale.