Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning

  title={Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning},
  author={Devang S. Ram Mohan and Raphael Lenain and Lorenzo Foglianti and Tian Huey Teh and Marlene Staib and Alexandra Torresquintero and Jiameng Gao},
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. Interleaving the action of reading a character with that of synthesising audio reduces this latency. However, the order of this sequence of interleaved actions varies across sentences, which raises the question of how the actions should be chosen. We propose a… 

Figures from this paper

Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model

This letter presents an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency and achieves a speech quality equivalent to waiting for the future context observation.

Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input

This paper investigates whether the use of predicted future text from a transformer language model can attenuate this loss in a neural TTS system and measures the prosodic features and finds that predicted text provides improvements over a zero- word lookahead, but only slight gains over random-word lookahead.

A Survey on Neural Speech Synthesis

A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.

Incremental Speech Synthesis For Speech-To-Speech Translation

This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context.

Efficient Incremental Text-to-Speech on GPUs

This work reveals the effectiveness of high-performance incremental TTS on GPUs with Instant Request Pooling and Module-wise Dynamic Batching and demonstrates that it outperforms the non-incremental twin in both concurrency and latency.

Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook

An overview of recent advancements of reinforcement learning and bandits is presented, and how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable are discussed.

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

This work minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer and formalizes this as a latency metric and presents a simple yet effective duration-scaling approach for latency reduction.

Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

This paper proposes an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model, and performs knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between the context embedding vectors of those models.



Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework

This work proposes a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation, which achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.

Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

The experimental results show that the proposed stepwise monotonic attention method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.

Monotonic Chunkwise Attention

Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, is proposed and shown that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

HMM training strategy for incremental speech synthesis

This study describes a voice training procedure which integrates explicitly a potential uncertainty on some contextual features in the context of HMM-based speech synthesis, and shows that the proposed strategy outperforms the baseline technique for French.

Online and Linear-Time Attention by Enforcing Monotonic Alignments

This work proposes an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time and validates the approach on sentence summarization, machine translation, and online speech recognition problems.

Local Monotonic Attention Mechanism for End-to-End Speech And Language Processing

Experimental results on ASR, G2P and machine translation between two languages with similar sentence structures demonstrate that the proposed encoder-decoder model with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.

MelNet: A Generative Model for Audio in the Frequency Domain

This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.

End-to-end attention-based large vocabulary speech recognition

This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

A Neural Transducer

A Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation, and performs well for long sequences even when attention mechanisms are not used.