Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System

  title={Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System},
  author={Bajibabu Bollepalli and Lauri Juvela and Paavo Alku},
Currently, there is increasing interest to use sequence-tosequence models in text-to-speech (TTS) synthesis with attention like that in Tacotron models. These models are end-to-end, meaning that they learn both co-articulation and duration properties directly from text and speech. Since these models are entirely data-driven, they need large amounts of data to generate synthetic speech of good quality. However, in challenging speaking styles, such as Lombard speech, it is difficult to record… 
Whispered and Lombard Neural Speech Synthesis
This paper presents and compares various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data, and shows that it can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles.
Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information
A high-capability speech synthesis system in which a representation of harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram, and conditional gated LSTM (CGLSTM) is proposed to control the flow of text-content information through network by re-weighting the L STM gates using speaker information.
Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis
This paper researches into the effective use of control information such as speaker and pitch which are differentiated from text-content information in the authors' encoder-decoder framework and proposes conditional gated LSTM (CGLSTM) whose input/output/forget gates are re-weighted by speaker embedding to control the flow of text- content information in the network.
Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
Additional subjective evaluation shows that Lombard-SSDRC TTS successfully increases the speech intelligibility with relative improvement of 455% for SSN and 104% for CSN in median keyword correction rate compared to the baseline TTS method.
Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings
  • Erica Cooper, Cheng-I Lai, +4 authors J. Yamagishi
  • Engineering, Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.
Building and Designing Expressive Speech Synthesis
A test for computer voices—the Ebert test, where a computer voice can successfully tell a joke and do the timing and delivery as well as Henny Youngman, then that’s the voice I want.
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.


Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora.
Lombard speech synthesis using long short-term memory recurrent neural networks
Three methods for Lombard speech adaptation in LSTM-based speech synthesis are proposed and it is shown that the LSTMs can achieve significantly better adaptation performance than the HMMs in both small and large data conditions.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Speaker-independent raw waveform model for glottal excitation
A multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech.
TTS synthesis with bidirectional LSTM based recurrent neural networks
Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis.
Deep Voice: Real-time Neural Text-to-Speech
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
A study of speaker adaptation for DNN-based speech synthesis
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.
Char2Wav: End-to-End Speech Synthesis
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.