Corpus ID: 236087801

Translatotron 2: Robust direct speech-to-speech translation

  title={Translatotron 2: Robust direct speech-to-speech translation},
  author={Ye Jia and Michelle Tadmor Ramanovich and Tal Remez and Roi Pomerantz},
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the… Expand

Figures and Tables from this paper

Direct simultaneous speech to speech translation
  • Xutai Ma, Hongyu Gong, +7 authors J. Pino
  • Computer Science, Engineering
  • ArXiv
  • 2021
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speechExpand
Incremental Speech Synthesis For Speech-To-Speech Translation
  • Danni Liu, Changhan Wang, Hongyu Gong, Xutai Ma, Yun Tang, J. Pino
  • Computer Science, Engineering
  • ArXiv
  • 2021
This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context. Expand
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal, demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody. Expand
Speaker Generation
TacoSpawn is a recurrent attentionbased text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Expand


Direct speech-to-speech translation with discrete units
This work presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation and designs a multitask learning framework with joint speech and text training that enables the model to generate dual mode output simultaneously in the same inference pass. Expand
Transformer-Based Direct Speech-To-Speech Translation with Transcoder
A step-by-step scheme to a complete end-to-end speech- to-speech translation and a Transformer-based speech translation using Transcoder are proposed and a multi-task model using syntactically similar and distant language pairs is compared. Expand
SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation
  • Arya D. McCarthy, Liezl Puzon, J. Pino
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is shown that the autoencoding speaker conversion approach can be combined with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model on an English–French AST task. Expand
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate textExpand
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models. Expand
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus
The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk. Expand
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech. Expand
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation. Expand
Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
A cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages and acquire decent naturalness and similarity for both languages is presented. Expand
The ATR Multilingual Speech-to-Speech Translation System
The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations. Expand