A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis

@article{Wang2018ACO,
  title={A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis},
  author={Xin Wang and Jaime Lorenzo-Trueba and Shinji Takaki and Lauri Juvela and Junichi Yamagishi},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={4804-4808}
}
Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches by means of a large scale crowdsourced evaluation. Results on acoustic models showed that… 

Figures from this paper

GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.
A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction
TLDR
A fair comparison of recent neural vocoders is presented in a signal reconstruction scenario, using such techniques to resynthesize speech waveforms from mel-scaled spectrograms, a compact and generally non-invertible representation of the underlying audio signal.
ExcitNet Vocoder: A Neural Excitation Model for Parametric Speech Synthesis Systems
TLDR
Experimental results show that the proposed ExcitNet vocoder, trained both speaker-dependently and speaker-independently, outperforms traditional linear prediction vocoders and similarly configured conventional WaveNet Vocoders.
Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
TLDR
Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.
Sequence-to-Sequence Acoustic Modeling for Voice Conversion
TLDR
Experimental results show that the proposed neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models.
Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis
TLDR
This study proposes a non-AR neural source-filter waveform model that can be directly trained using spectrum-based training criteria and the stochastic gradient descent method and the quality of its synthetic speech is close to that of speech generated by the AR WaveNet.
Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
TLDR
Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.
Effective parameter estimation methods for an ExcitNet model in generative text-to-speech systems
TLDR
This paper proposes a high-quality generative text-to-speech (TTS) system using an effective spectrum and excitation estimation method and verified the merit of the proposed method in producing expressive speech segments by adopting a global style token-based emotion embedding method.
Speaker-independent raw waveform model for glottal excitation
TLDR
A multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech.
- A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
TLDR
This study presents a raw waveform glottal excitation model, called GlotNet, and compares its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Direct Modeling of Frequency Spectra and Waveform Generation Based on Phase Recovery for DNN-Based Speech Synthesis
TLDR
Direct modeling of frequency spectra and waveform generation based on phase recovery of STFT spectral amplitudes that include harmonics information derived from F0 are directly predicted through a DNN-based acoustic model and Griffin and Lim’s approach to recover phase and generate waveforms is investigated.
An autoregressive recurrent mixture density network for parametric speech synthesis
TLDR
A recurrent mixture density network that incorporates a trainable autoregressive model that learns to be a filter that emphasizes the high frequency components of the target acoustic feature trajectories in the training stage and increases their global variance in the synthesis stage.
Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis
TLDR
A simple new representation for the FFT spectrum tailored to statistical parametric speech synthesis that uses simple and computationally cheap operations and can operate at a lower frame rate than the 200 frames-per-second typical in many systems.
Generative adversarial network-based postfilter for statistical parametric speech synthesis
TLDR
Objective evaluation of experimental results shows that the GAN-based postfilter can compensate for detailed spectral structures including modulation spectrum, and subjective evaluation shows that its generated speech is comparable to natural speech.
A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis
TLDR
Both objective measures and results from subjective listening tests show that the proposed method performs significantly better than a conventional architecture that requires the linguistic input to be at the acoustic frame rate.
Generative Adversarial Network-Based Postfilter for STFT Spectrograms
TLDR
The proposed postfilter can be used to reduce the gap between synthesized and target spectra, even in the highdimensional STFT domain, and is applied to a DNN-based speech-synthesis task.
Speaker-Dependent WaveNet Vocoder
TLDR
A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet.
Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders
TLDR
A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.
A Log Domain Pulse Model for Parametric Speech Synthesis
TLDR
A new signal model is proposed that leads to a simple synthesizer, without the need for ad-hoc tuning of model parameters, which adopts a combination of speech components that are additive in the log domain.
...
1
2
3
4
...