A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate

  title={A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate},
  author={Ahmed Mustafa and Jan B{\"u}the and Srikanth Korse and Kishan Gupta and Guillaume Fuchs and Nicola Pia},
  journal={2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  • Ahmed MustafaJ. Büthe N. Pia
  • Published 9 August 2021
  • Computer Science
  • 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Recently, GAN vocoders have seen rapid progress in speech synthesis, starting to outperform autoregressive models in perceptual quality with much higher generation speed. However, autoregressive vocoders are still the common choice for neural generation of speech signals coded at very low bit rates. In this paper, we present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s. The proposed model is a modified version of the StyleMelGAN vocoder… 

Figures and Tables from this paper

Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational Complexity

This work proposes a new architecture for GAN vocoders that mainly depends on recurrent and fully-connected networks to di-rectly generate the time domain signal in framewise manner that results in considerable reduction of the computational cost and enables very fast generation on both GPUs and low-complexity CPUs.

PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech

PostGAN is proposed, a GAN-based neural post-processor that operates in the sub-band domain and relies on the U-Net architecture and a learned affine transform that surpasses previously published methods and can improve the quality of coded speech by around 20 MUSHRA points.

NESC: Robust Neural End-2-End Speech Coding with GANs

This work presents Neural End-2-End Speech Codec (NESC), a robust, scalable end-to-end neural speech codec for high-quality wideband speech coding at 3 kbps.

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

A new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations and reconstruct speech waveform is proposed.



StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

StyleMelGAN is a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity, and MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

DurIAN: Duration Informed Attention Network for Speech Synthesis

It is shown that proposed DurIAN system could generate highly natural speech that is on par with current state of the art end-to-end systems, while being robust and stable at the same time.

High Fidelity Speech Synthesis with Adversarial Networks

GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

It is demonstrated that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPC net can exceed the quality of a waveform codec operating at low bitrate, opening the way for new codec designs based on neural synthesis models.

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

This paper improves the original MelGAN by increasing the receptive field of the generator and substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech.

LPCNET: Improving Neural Speech Synthesis through Linear Prediction

  • J. ValinJ. Skoglund
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.

High-quality Speech Coding with Sample RNN

We provide a speech coding scheme employing a generative model based on SampleRNN that, while operating at significantly lower bitrates, matches or surpasses the perceptual quality of

Generative Speech Coding with Predictive Variance Regularization

This work introduces predictive-variance regularization to reduce the sensitivity to outliers and provides extensive subjective performance evaluations that show that the system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.