Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

@article{Yamamoto2020ParallelWA,
  title={Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram},
  author={Ryuichi Yamamoto and Eunwoo Song and Jae-Min Kim},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={6199-6203}
}
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the… 

Figures and Tables from this paper

Parallel Synthesis for Autoregressive Speech Generation
TLDR
Compared with the baseline autoregressive and non-autoregressive models, the proposed model achieves better MOS and shows its good generalization ability while synthesizing 44 kHz speech or utterances from unseen speakers.
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
TLDR
This work proposes DiffSinger, an acoustic model for SVS based on the diffusion probabilistic model, a parameterized Markov chain that iteratively converts the noise into mel-spectrogram conditioned on the music score that outperforms state-of-the-art SVS work.
VocBench: A Neural Vocoder Benchmark for Speech Synthesis
TLDR
VocBench is presented, a framework that benchmark the performance of state-of-the-art neural vocoders and demonstrates that the framework can show competitive efficacy and quality of the synthesized samples for each vocoder.
High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling
TLDR
The experimental results demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions, while allowing for real-time low-latency processing using a single core of ∼ 2.1–2.7 GHz CPU with ∼ 0.57–0.64 real- time factor.
Periodnet: A Non-Autoregressive Waveform Generation Model with a Structure Separating Periodic and Aperiodic Components
TLDR
Experiments using a singing voice corpus show that the proposed structure improves the naturalness of the generated waveform, and the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion
TLDR
Extensive experiments demonstrate that speaker adaptation can achieve higher speaker similarity, and the speaker encoder based conversion model can greatly reduce dysarthric and non-native pronunciation patterns with improved speech intelligibility.
StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
TLDR
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity, and MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.
Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech
TLDR
This paper improves the original MelGAN by increasing the receptive field of the generator and substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech.
The HITSZ TTS system for Blizzard challenge 2020
TLDR
The techniques that were used in HITSZ-TTS 1 entry in Blizzard Challenge 2020 are presented and the evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance.
WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU
TLDR
WG-WaveNet is composed of a compact flow-based model and a post-filter that requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post- filter maintains the quality of generated waveform.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Probability density distillation with generative adversarial networks for high-quality parallel waveform generation
TLDR
This paper proposes an effective probability density distillation (PDD) algorithm for WaveNet-based parallel waveform generation (PWG) systems that outperform both those using conventional approaches, and also autoregressive generation systems with a well-trained teacher WaveNet.
On the Variance of the Adaptive Learning Rate and Beyond
TLDR
This work identifies a problem of the adaptive learning rate, suggests warmup works as a variance reduction technique, and proposes RAdam, a new variant of Adam, by introducing a term to rectify the variance of theadaptive learning rate.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
Deconvolution and Checkerboard Artifacts
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Generative Adversarial Network based Speaker Adaptation for High Fidelity WaveNet Vocoder
TLDR
This paper proposes an end-to-end adaptation method based on the generative adversarial network (GAN), which can reduce the computational cost for the training of new speaker adaptation and can further reduce the quality gap between generated and natural waveforms.
ExcitNet Vocoder: A Neural Excitation Model for Parametric Speech Synthesis Systems
TLDR
Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet Vocoders.
Investigation of Enhanced Tacotron Text-to-speech Synthesis Systems with Self-attention for Pitch Accent Language
TLDR
The results reveal that although the proposed systems still do not match the quality of a top-line pipeline system for Japanese, they show important stepping stones towards end-to-end Japanese speech synthesis.
NSML: Meet the MLaaS platform with a real-world case study
TLDR
This work proposed NSML, a machine learning as a service (MLaaS) platform, which helps machine learning work be easily launched on a NSML cluster and provides a collaborative environment which can afford development at enterprise scale.
...
...