DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding
@article{Nikonorov2021DEEPAAD, title={DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding}, author={Sergey Nikonorov and Berrak Sisman and Mingyang Zhang and Haizhou Li}, journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, year={2021}, pages={618-625} }
Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA – a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the…
References
SHOWING 1-10 OF 37 REFERENCES
Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling
- Computer ScienceINTERSPEECH
- 2019
Experiments on a Chinese singing voice corpus demonstrate that the method using DARs can produce F0 contours with vibratos effectively, and can achieve better objective and subjective performance than the conventional method using recurrent neural networks (RNNs).
Deep Voice: Real-time Neural Text-to-Speech
- Computer ScienceICML
- 2017
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Investigation of F0 conditioning and Fully Convolutional Networks in Variational Autoencoder based Voice Conversion
- Computer ScienceINTERSPEECH
- 2019
This work reconsiders the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesizes that the spectral features are in fact F0 dependent, and proposes to utilize the F0 as an additional input of the decoder.
Deep neural network based voice conversion with a large synthesized parallel corpus
- Computer Science2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
- 2016
A voice conversion framework to map the speech features of a source speaker to a target speaker based on deep neural networks (DNNs) and a lower log spectral distortion can still be seen over the conventional Gaussian mixture model (GMM) approach.
An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2021
This article provides a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discusses their promise and limitations.
DDSP: Differentiable Digital Signal Processing
- Computer ScienceICLR
- 2020
The Differentiable Digital Signal Processing library is introduced, which enables direct integration of classic signal processing elements with deep learning methods and achieves high-fidelity generation without the need for large autoregressive models or adversarial losses.
Neural Homomorphic Vocoder
- Computer ScienceINTERSPEECH
- 2020
The neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework, which synthesizes speech by filtering impulse trains and noise with linear time-varying filters and is highly efficient, fully controllable and interpretable.
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
- Computer ScienceNeurIPS
- 2019
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.
VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network
- Computer ScienceINTERSPEECH
- 2020
VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform, and exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead.
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.