Speaker-independent raw waveform model for glottal excitation

@inproceedings{Juvela2018SpeakerindependentRW,
  title={Speaker-independent raw waveform model for glottal excitation},
  author={Lauri Juvela and Vassilis Tsiaras and Bajibabu Bollepalli and Manu Airaksinen and Junichi Yamagishi and Paavo Alku},
  booktitle={INTERSPEECH},
  year={2018}
}
Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However… 

Figures from this paper

- A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
TLDR
This study presents a raw waveform glottal excitation model, called GlotNet, and compares its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures.
GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
TLDR
This study presents a raw waveform glottal excitation model, called GlotNet, and compares its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures.
Speaker-Adaptive Neural Vocoders for Parametric Speech Synthesis Systems
TLDR
Experimental results verify that the proposed TTS systems with speaker-adaptive neural vocoders outperform those with traditional source-filter model-based vocoder and those with WaveNet vocodering models, trained either speaker-dependently or speaker-independently.
ExcitNet Vocoder: A Neural Excitation Model for Parametric Speech Synthesis Systems
TLDR
Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet Vocoders.
Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder
TLDR
A modeling-by-generation (MbG) excitation vocoder for a neural text-to-speech (TTS) system that provides high-quality synthetic speech by achieving a mean opinion score of 4.57 within the TTS framework is proposed.
Speaker-adaptive neural vocoders for statistical parametric speech synthesis systems
TLDR
Experimental results verify that the proposed SPSS systems with speaker-adaptive neural vocoders outperform those with traditional source-filter model-based vocodery and those with WaveNet vocoder, trained either speaker-dependently or speaker-independently.
Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis
TLDR
It was demonstrated that the NSF models generated waveforms at least 100 times faster than the authors' WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.
Online Speaker Adaptation for WaveNet-based Neural Vocoders
  • Qiuchen Huang, Yang Ai, Zhenhua Ling
  • Physics, Computer Science
    2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2020
TLDR
Experimental results demonstrate that the proposed online speaker adaptation method can achieve a better objective and subjective performance on reconstructing waveforms of unseen speakers than the conventional speaker-independent WaveNet vocoder.
LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis
TLDR
An LP-WaveNet vocoder, where the complicated interactions between vocal source and vocal tract components are jointly trained within a mixture density networkbased WaveNet model, which outperforms the conventional WaveNet vocoders both objectively and subjectively.
Non-Parallel Voice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression
TLDR
This paper integrates a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique and proposes a collapsed speech segment detector (CSSD) to mitigate the negative effects introduced by the LPCDC.
...
1
2
3
4
...

References

SHOWING 1-10 OF 28 REFERENCES
An investigation of multi-speaker training for wavenet vocoder
TLDR
The experimental results demonstrate that 1) the multispeaker WaveNet vocoder still outperforms STRAIGHT in generating known speakers' voices but it is comparable to STRAight in generating unknown speaker's voices, and 2) the multi-speaker training is effective for developing the Wave net vocoder capable of speech modification.
Speaker-Dependent WaveNet Vocoder
TLDR
A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet.
A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis
TLDR
This paper builds a framework in which new vocoding and acoustic modeling techniques with conventional approaches are compared by means of a large scale crowdsourced evaluation, and shows that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best.
High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network
TLDR
Subjective listening tests conducted on an US English female voice show that the proposed QCP-DNN method gives significant improvement in synthetic naturalness compared to the two previously developed glottal vocoders.
HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering
TLDR
An hidden Markov model (HMM)-based speech synthesizer that utilizes glottal inverse filtering for generating natural sounding synthetic speech and the quality is clearly better compared to two HMM-based speech synthesis systems based on widely used vocoder techniques.
Generative Adversarial Network-Based Glottal Waveform Model for Statistical Parametric Speech Synthesis
TLDR
A new method for predicting glottal waveforms by generative adversarial networks (GANs) is proposed, and the newly proposed GANs achieve synthesis quality comparable to that of widely-used DNNs, without using an additive noise component.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Wavenet Based Low Rate Speech Coding
TLDR
This work describes how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s and shows that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener.
ON the Use of Wavenet as a Statistical Vocoder
TLDR
This paper used two female and two male speakers from the CMU-ARCTIC database to contrast the use of cepstrum coefficients and filter-bank features as local conditioners with the goal to improve the overall quality for both male and female speakers.
...
1
2
3
...