Vocbench: A Neural Vocoder Benchmark for Speech Synthesis

@article{AlBadawy2021VocbenchAN,
  title={Vocbench: A Neural Vocoder Benchmark for Speech Synthesis},
  author={Ehab A. AlBadawy and Andrew Gibiansky and Qing He and Jilong Wu and Ming-Ching Chang and Siwei Lyu},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={881-885}
}
Neural vocoders, used for converting the spectral representations of an audio signal to the waveforms, are a commonly used component in speech synthesis pipelines. It focuses on synthesizing waveforms from low-dimensional representation, such as Mel-Spectrograms. In recent years, different approaches have been introduced to develop such vocoders. However, it becomes more challenging to assess these new vocoders and compare their performance to previous ones. To address this problem, we present… 

Figures and Tables from this paper

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

This work presents BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting, and introduces periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform synthesis and significantly improves audio quality.

References

SHOWING 1-10 OF 33 REFERENCES

A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

The obtained results suggest that the choice of the voice has a profound impact on the overall quality of the vocoder-generated speech, and the best vocoder for each voice can vary case by case, indicating that the waveform generation method of a vocoder is essential for quality improvements.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This work introduces a deep learning-based approach to do voice conversion with speech style transfer across different speakers using a combination of Variational Auto-Encoder and Generative Adversarial Network as the main components followed by a WaveNet-based vocoder.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

Neural Speech Synthesis with Transformer Network

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

LPCNET: Improving Neural Speech Synthesis through Linear Prediction

  • J. ValinJ. Skoglund
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks

Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

This paper improves the original MelGAN by increasing the receptive field of the generator and substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech.