GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram

@inproceedings{Juvela2019GELPGL,
  title={GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram},
  author={Lauri Juvela and Bajibabu Bollepalli and Junichi Yamagishi and Paavo Alku},
  booktitle={INTERSPEECH},
  year={2019}
}
Recent advances in neural network -based text-to-speech have reached human level naturalness in synthetic speech. [...] Key Result Results show that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality.Expand
ExcitGlow: Improving a WaveGlow-based Neural Vocoder with Linear Prediction Analysis
TLDR
This paper proposes ExcitGlow, a vocoder that incorporates the source-filter model of voice production theory into a flow-based deep generative model and chooses negative log-likelihood (NLL) loss for the excitation signal and multi-resolution spectral distance for the speech signal.
SFNet: A Computationally Efficient Source Filter Model Based Neural Speech Synthesis
  • A. Mv, P. Ghosh
  • Computer Science
    IEEE Signal Processing Letters
  • 2020
TLDR
There is a significant reduction in the memory and computational complexity compared to the state-of-the-art speaker independent neural speech synthesizer without any loss of the naturalness of the synthesized speech.
Neural Homomorphic Vocoder
TLDR
The neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework, which synthesizes speech by filtering impulse trains and noise with linear time-varying filters and is highly efficient, fully controllable and interpretable.
An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis
TLDR
Both objective and subjective tests show the proposed subband LPCNet neural vocoder can synthesize higher quality speech than the original fullband one (MOS 4.62 vs. 4.54), at a rate nearly three times faster.
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
TLDR
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
Transformer-PSS: A High-Efficiency Prosodic Speech Synthesis Model based on Transformer
Much attention has been given to prosodic speech synthesis with the progress of human-computer interaction and automatic content generation. However, one of its disadvantages is high computational
Towards Universal Neural Vocoding with a Multi-band Excited WaveNet
TLDR
This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices consisting of multiple specialized DNN that are combined with dedicated signal processing components and demonstrates remaining limits of the universality of neural vocoders e.g. the creation of saturated singing voices.
Gaussian Lpcnet for Multisample Speech Synthesis
TLDR
A modification of LPCNet vocoder is presented that is 1.5x faster, has twice less non-zero parameters and synthesizes speech of the same quality.
A Survey on Neural Speech Synthesis
TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
...
1
2
...

References

SHOWING 1-10 OF 40 REFERENCES
Waveform Generation for Text-to-speech Synthesis Using Pitch-synchronous Multi-scale Generative Adversarial Networks
TLDR
Listening test results show that while direct waveform generation with GAN is still far behind WaveNet, a GAN-based glottal excitation model can achieve quality and voice similarity on par with a WaveNet vocoder.
A Comparison of Recent Waveform Generation and Acoustic Modeling Methods for Neural-Network-Based Speech Synthesis
TLDR
This paper builds a framework in which new vocoding and acoustic modeling techniques with conventional approaches are compared by means of a large scale crowdsourced evaluation, and shows that generative adversarial networks and an autoregressive (AR) model performed better than a normal recurrent network and the AR model performed best.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Speech Waveform Synthesis from MFCC Sequences with Generative Adversarial Networks
  • Lauri Juvela, B. Bollepalli, +4 authors P. Alku
  • Computer Science, Engineering
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This paper proposes a method for generating speech from filterbank mel frequency cepstral coefficients (MFCC), which are widely used in speech applications, such as ASR, but are generally considered
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
LPCNET: Improving Neural Speech Synthesis through Linear Prediction
  • J. Valin, J. Skoglund
  • Computer Science, Engineering
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
It is demonstrated that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPC net speech synthesis is achievable with a complexity under 3 GFLOPS, which makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.
LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis
TLDR
An LP-WaveNet vocoder, where the complicated interactions between vocal source and vocal tract components are jointly trained within a mixture density networkbased WaveNet model, which outperforms the conventional WaveNet vocoders both objectively and subjectively.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Speaker-independent raw waveform model for glottal excitation
TLDR
A multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech.
GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis
TLDR
This study presents a raw waveform glottal excitation model, called GlotNet, and compares its performance with the corresponding direct speech waveform model, WaveNet, using equivalent architectures.
...
1
2
3
4
...