DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding

  title={DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding},
  author={Sergey Nikonorov and Berrak Sisman and Mingyang Zhang and Haizhou Li},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA – a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the… 

Figures and Tables from this paper



Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling

Experiments on a Chinese singing voice corpus demonstrate that the method using DARs can produce F0 contours with vibratos effectively, and can achieve better objective and subjective performance than the conventional method using recurrent neural networks (RNNs).

Deep Voice: Real-time Neural Text-to-Speech

Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion

This work reconsiders the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesizes that the spectral features are in fact F0 dependent, and proposes to utilize the F0 as an additional input of the decoder.

Deep neural network based voice conversion with a large synthesized parallel corpus

A voice conversion framework to map the speech features of a source speaker to a target speaker based on deep neural networks (DNNs) and a lower log spectral distortion can still be seen over the conventional Gaussian mixture model (GMM) approach.

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.

DDSP: Differentiable Digital Signal Processing

The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.

Neural Homomorphic Vocoder

The neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework, which synthesizes speech by filtering impulse trains and noise with linear time-varying filters and is highly efficient, fully controllable and interpretable.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform, and exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead.

LPCNET: Improving Neural Speech Synthesis through Linear Prediction

  • J. ValinJ. Skoglund
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.