• Corpus ID: 233301488

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

@article{Beliaev2021TalkNet2N,
  title={TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction},
  author={Stanislav Beliaev and Boris Ginsburg},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08189}
}
We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model consists of three feed-forward convolutional networks. The first network predicts grapheme durations. An input text is then expanded by repeating each symbol according to the predicted duration. The second network predicts pitch value for every mel frame. The third network generates a mel-spectrogram from the expanded text conditioned on predicted pitch… 

Figures and Tables from this paper

FlexLip: A Controllable Text-to-Lip System

TLDR
This paper tackles a subissue of the text-to-video generation problem, by converting the text into lip landmarks using a modular, controllable system architecture and introduces a series of objective evaluation measures that are comparable with those obtained when using a larger set of training samples.

A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond

TLDR
This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.

Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

TLDR
Nix-TTS is a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized generative TTS teacher model, and achieves over 3.26 × and 8.36 × inference speedup on Intel-i7 CPU and Raspberry Pi respectively.

Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings

TLDR
This paper describes Mixer-TTS, a non-autoregressive model for mel-spectrogram generation based on the MLP-Mixer architecture adapted for speech synthesis which achieves much faster speech synthesis compared to the models with similar quality.

A Survey on Neural Speech Synthesis

TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.

References

SHOWING 1-10 OF 31 REFERENCES

FastSpeech: Fast, Robust and Controllable Text to Speech

TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

TLDR
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.

Deep Voice: Real-time Neural Text-to-Speech

TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Waveglow: A Flow-based Generative Network for Speech Synthesis

TLDR
WaveGlow is a flow-based network capable of generating high quality speech from mel-spectrograms, implemented using only a single network, trained using a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

TLDR
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

TLDR
Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model.

FastPitch: Parallel Text-to-speech with Pitch Prediction

TLDR
It is found that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice, making it comparable to state-of-the-art speech.

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

TLDR
FastSpeech 2 is proposed, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by directly training the model with ground-truth target instead of the simplified output from teacher, and introducing more variation information of speech as conditional inputs.

Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

TLDR
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.

NeMo: a toolkit for building AI applications using Neural Modules

TLDR
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition that provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs.