Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

  title={Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration},
  author={Chuanxin Tang and Chong Luo and Zhiyuan Zhao and Dacheng Yin and Yucheng Zhao and Wenjun Zeng},
  • Chuanxin Tang, Chong Luo, +3 authors Wenjun Zeng
  • Published 12 September 2021
  • Computer Science, Engineering
  • ArXiv
Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel… Expand

Figures and Tables from this paper


VoCo: text-based insertion and replacement in audio narration
This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration, using a text to speech synthesizer to say the word in a generic voice, and then using voice conversion to convert it into a voice that matches the narration. Expand
Context-Aware Prosody Correction for Text-Based Speech Editing
This work proposes a new context-aware method for more natural sounding text-based editing of speech that uses a series of neural networks to generate salient prosody features that are dependent on the prosody of speech surrounding the edit and amenable to fine-grained user control. Expand
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech. Expand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Text-based editing of talking-head video
This work proposes a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Expand
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Cute: A concatenative method for voice conversion using exemplar-based unit selection
This work proposes a method that circumvents voice conversion concern using concatenative synthesis coupled with exemplar-based unit selection, and introduces triphone-based preselection that greatly reduces computation and enforces selection of long, contiguous pieces. Expand
Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi
The Montreal Forced Aligner (MFA) is an update to the Prosodylab-Aligner, and maintains its key functionality of trainability on new data, as well as incorporating improved architecture (triphone acoustic models and speaker adaptation), and other features. Expand
Neural Speech Synthesis with Transformer Network
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. Expand