GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

@article{Juvela2019GlotNetARW,
  title={GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis},
  author={Lauri Juvela and Bajibabu Bollepalli and Vassilis Tsiaras and Paavo Alku},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2019},
  volume={27},
  pages={1019-1030}
}
Recently, generative neural network models which operate directly on raw audio, such as WaveNet, have improved the state of the art in text-to-speech synthesis (TTS). Moreover, there is increasing interest in using these models as statistical vocoders for generating speech waveforms from various acoustic features. However, there is also a need to reduce the model complexity, without compromising the synthesis quality. Previously, glottal pulseforms (i.e., time-domain waveforms corresponding to… 

Figures and Tables from this paper

SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis
  • A. Mv, P. Ghosh
  • Computer Science, Engineering
    IEEE Signal Processing Letters
  • 2020
TLDR
There is a significant reduction in the memory and computational complexity compared to the state-of-the-art speaker independent neural speech synthesizer without any loss of the naturalness of the synthesized speech.
ExcitGlow: Improving a WaveGlow-based Neural Vocoder with Linear Prediction Analysis
TLDR
This paper proposes ExcitGlow, a vocoder that incorporates the source-filter model of voice production theory into a flow-based deep generative model and chooses negative log-likelihood (NLL) loss for the excitation signal and multi-resolution spectral distance for the speech signal.
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis
  • Yang Ai, Zhenhua Ling
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2020
TLDR
This article presents a neural vocoder named HiNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra hierarchically and achieves better naturalness of reconstructed speech than the conventional STRAIGHT vocoder, a 16-bit WaveNet vocoder using open source implementation and an NSF vocoder with similar complexity to the PSP.
Towards Universal Neural Vocoding with a Multi-band Excited WaveNet
TLDR
This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices consisting of multiple specialized DNN that are combined with dedicated signal processing components and demonstrates remaining limits of the universality of neural vocoders e.g. the creation of saturated singing voices.
A Survey on Neural Speech Synthesis
TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
Improving LPCNET-Based Text-to-Speech with Linear Prediction-Structured Mixture Density Network
TLDR
An improved LPCNet vocoder using a linear prediction (LP)-structured mixture density network (MDN) and the LP-MDN, which enables the autoregressive neural vocoder to structurally represent the interactions between the vocal tract and vocal source components is proposed.
Using Cyclic Noise as the Source Signal for Neural Source-Filter-Based Speech Waveform Model
TLDR
A more flexible source signal called cyclic noise, a quasi-periodic noise sequence given by the convolution of a pulse train and a static random noise with a trainable decaying rate that controls the signal shape is proposed.
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations
TLDR
A neural analysis and synthesis framework that can manipulate voice, pitch, and speed of an arbitrary speech signal by proposing a novel training strategy based on information perturbation, which allows for fully self-supervised training.
On Adaptive LASSO-based Sparse Time-Varying Complex AR Speech Analysis
  • K. Funaki
  • Computer Science
    2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)
  • 2021
TLDR
Adapt LASSO-based TV-CAR analysis is proposed, and the performance is evaluated using the F0 estimation, which shows that the resulting LP residual makes it possible to estimate a more precise F0.
...
1
2
3
...

References

SHOWING 1-10 OF 59 REFERENCES
Speaker-independent raw waveform model for glottal excitation
TLDR
A multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech.
A New Glottal Neural Vocoder for Speech Synthesis
TLDR
A novel neural glottal vocoder is proposed which tends to bridge the gap between the traditional parametric vocoder and end-to-end speech sample generation and can generate high-quality speech with efficient computations.
Voice source modelling using deep neural networks for statistical parametric speech synthesis
  • T. Raitio, Heng Lu, P. Alku
  • Computer Science
    2014 22nd European Signal Processing Conference (EUSIPCO)
  • 2014
TLDR
This paper presents a voice source modelling method employing a deep neural network (DNN) to map from acoustic features to the time-domain glottal flow waveform and enables easy modification of the waveform by adjusting the input values to the DNN.
Reducing Mismatch in Training of DNN-Based Glottal Excitation Models in a Statistical Parametric Text-to-Speech System
TLDR
Two modifications to the excitation model training scheme are proposed that improve performance measured in MSE and MFCC distortion, and slightly improve the subjective quality of the synthetic speech.
High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network
TLDR
Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.
A Unified Framework for the Generation of Glottal Signals in Deep Learning-based Parametric Speech Synthesis Systems
TLDR
A unified training framework for the generation of glottal signals in deep learning (DL)-based parametric speech synthesis systems is proposed by merging all the required models, such as acoustic,glottal, and noise models, into a single unified network.
GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis
TLDR
The proposed GlottDNN vocoder was evaluated as part of a full-band state-of-the-art DNN-based text-to-speech (TTS) synthesis system and compared against the release version of the original GlottHMM vocoder, and the well-known STRAIGHT vocoder.
A Comparison Between STRAIGHT, Glottal, and Sinusoidal Vocoding in Statistical Parametric Speech Synthesis
TLDR
The obtained results suggest that the choice of the voice has a profound impact on the overall quality of the vocoder-generated speech, and the best vocoder for each voice can vary case by case, indicating that the waveform generation method of a vocoder is essential for quality improvements.
Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis
TLDR
A simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis that uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems.
Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort
TLDR
The proposed voice source model is compared to a robust and high-quality excitation modelling method based on manually selected mean glottal flow pulses for each vocal effort level and using a spectral matching filter to correctly match the voice source spectrum to a desired style.
...
1
2
3
4
5
...