• Corpus ID: 233301488

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

@article{Beliaev2021TalkNet2N,
  title={TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction},
  author={Stanislav Beliaev and Boris Ginsburg},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08189}
}
We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model consists of three feed-forward convolutional networks. The first network predicts grapheme durations. An input text is then expanded by repeating each symbol according to the predicted duration. The second network predicts pitch value for every mel frame. The third network generates a mel-spectrogram from the expanded text conditioned on predicted pitch… 

Figures and Tables from this paper

FlexLip: A Controllable Text-to-Lip System

This paper tackles a subissue of the text-to-video generation problem, by converting the text into lip landmarks using a modular, controllable system architecture and introduces a series of objective evaluation measures that are comparable with those obtained when using a larger set of training samples.

A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond

This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.

Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Nix-TTS is a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized generative TTS teacher model, and achieves over 3.26 × and 8.36 × inference speedup on Intel-i7 CPU and Raspberry Pi respectively.

Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation

Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model, is presented.

Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings

This paper describes Mixer-TTS, a non-autoregressive model for mel-spectrogram generation based on the MLP-Mixer architecture adapted for speech synthesis which achieves much faster speech synthesis compared to the models with similar quality.

A Survey on Neural Speech Synthesis

A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.

References

SHOWING 1-10 OF 31 REFERENCES

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.

Deep Voice: Real-time Neural Text-to-Speech

Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Waveglow: A Flow-based Generative Network for Speech Synthesis

WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model.

FastPitch: Parallel Text-to-speech with Pitch Prediction

It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.

NeMo: a toolkit for building AI applications using Neural Modules

NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition that provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs.