• Corpus ID: 239009547

ESPnet2-TTS: Extending the Edge of TTS Research

  title={ESPnet2-TTS: Extending the Edge of TTS Research},
  author={Tomoki Hayashi and Ryuichi Yamamoto and Takenori Yoshimura and Peter Wu and Jiatong Shi and Takaaki Saeki and Yooncheol Ju and Yusuke Yasuda and Shinnosuke Takamichi and Shinji Watanabe},
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-thefly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E textto-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art… 

Tables from this paper

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis
This paper describes the major framework of Muskits, its functionalities, and experimental results in single-singer, multi-s singer, multilingual, and transfer learning scenarios, and demonstrates several advanced usages based on the toolkit functionalities.
JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
The proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations and removes dependency on an external speech-text alignment tool by adopting an alignment learning objective in the joint training framework.
DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
This work proposes a degradationrobust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions, and proposes a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information.
Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble
Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields. Most existing work focuses heavily on languages with abundant training datasets, which limits the scope of target languages
Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
Embedded vectors derived from articulatory vectors rather than embeddings derived from phoneme identities are used to learn phoneme representations that hold across languages, enabling fine-tune a high-quality textto-speech model on just 30 minutes of data in a previously unseen language spoken by aPreviously unseen speaker.


Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit
The experimental results show that the ESPnet-TTS models can achieve state-of-the-art performance comparable to the other latest toolkits, resulting in a mean opinion score (MOS) of 4.25 on the LJSpeech dataset.
Recent Developments on Espnet Toolkit Boosted By Conformer
This paper shows the results for a wide range of end- to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS).
Close to Human Quality TTS with Transformer
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech
This work presents a parallel endto-end TTS method that generates more natural sounding audio than current two-stage models and adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling.
ESPnet: End-to-End Speech Processing Toolkit
A major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks are explained.
Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Experimental results show that a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora, can facilitate data-efficient training and outperform an RNN-basedseq VC model in terms of intelligibility, naturalness, and similarity.
fairseq Sˆ2: A Scalable and Integrable Speech Synthesis Toolkit
This paper presents fairseq Sˆ2, a fairseq extension for speech synthesis that implements a number of autoregressive and non-AR text-to-speech models, and their multi-speaker variants, and a suite of automatic metrics.
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units, to alleviate the economic costs of training.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.