WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

@article{Wang2022WOLONetWO,
  title={WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis},
  author={Yi Wang and Yi Si},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.09920}
}
Recently, GAN-based neural vocoders such as Parallel WaveGAN[1], MelGAN[2], HiFiGAN[3], and UnivNet[4] have become popular due to their lightweight and parallel structure, resulting in a real-time synthesized waveform with high fidelity, even on a CPU. HiFiGAN[3] and UnivNet[4] are two SOTA vocoders. Despite their high quality, there is still room for improvement. In this paper, motivated by the structure of Vision Outlooker from computer vision, we adopt a similar idea and propose an effective… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 33 REFERENCES

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

TLDR
StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity, and MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction

TLDR
The FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding, is proposed, which can significantly improve the efficiency of speech synthesis.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

TLDR
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

TLDR
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

WaveFlow: A Compact Flow-based Model for Raw Audio

TLDR
WaveFlow provides a unified view of likelihood-based models for 1-D data, including WaveNet and WaveGlow as special cases, while synthesizing several orders of magnitude faster as it only requires a few sequential steps to generate very long waveforms with hundreds of thousands of time-steps.

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

TLDR
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.

Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains

TLDR
The proposed Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains achieved the best mean opinion score (MOS) in most scenarios using ground-truth mel-spectrogram as an input and showed superior performance in unseen domains with regard of speaker, emotion, and language.

Waveglow: A Flow-based Generative Network for Speech Synthesis

TLDR
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

TLDR
FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models that produces high quality audio and supports streaming during inference while remaining highly computationally efficient.

DurIAN: Duration Informed Attention Network For Multimodal Synthesis

TLDR
It is shown that DurIAN could generate highly natural speech that is on par with current state of the art end-to-end systems, while at the same time avoid word skipping/repeating errors in those systems.