• Corpus ID: 236087801

Translatotron 2: Robust direct speech-to-speech translation

  title={Translatotron 2: Robust direct speech-to-speech translation},
  author={Ye Jia and Michelle Tadmor Ramanovich and Tal Remez and Roi Pomerantz},
We present Translatotron 2, a neural direct speech-to-speech translation model that can be trained end-to-end. Translatotron 2 consists of a speech encoder, a phoneme decoder, a mel-spectrogram synthesizer, and an attention module that connects all the previous three components. Experimental results suggest that Translatotron 2 outperforms the original Translatotron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the… 

Figures and Tables from this paper

Textless Speech-to-Speech Translation on Real Data
To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
To build strong cascade S2ST baselines, an ST model on CoVoST 2 is trained, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU.
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
  • Xutai Ma, Hongyu Gong, +6 authors J. Pino
  • Computer Science, Engineering
  • 2021
Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency and the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation.
Direct simultaneous speech to speech translation
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech
Incremental Speech Synthesis For Speech-To-Speech Translation
This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context.
More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal, demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody.
Speaker Generation
TacoSpawn is a recurrent attentionbased text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers.


Direct speech-to-speech translation with discrete units
This work presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation and designs a multitask learning framework with joint speech and text training that enables the model to generate dual mode output simultaneously in the same inference pass.
Transformer-Based Direct Speech-To-Speech Translation with Transcoder
A step-by-step scheme to a complete end-to-end speech- to-speech translation and a Transformer-based speech translation using Transcoder are proposed and a multi-task model using syntactically similar and distant language pairs is compared.
SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation
  • Arya D. McCarthy, Liezl Puzon, J. Pino
  • Computer Science, Engineering
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is shown that the autoencoding speaker conversion approach can be combined with augmentation by machine-translated transcripts to obtain a competitive end-to-end AST model that outperforms a very strong cascade model on an English–French AST task.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus
The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Cross-Lingual, Multi-Speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
A cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages and acquire decent naturalness and similarity for both languages is presented.
The ATR Multilingual Speech-to-Speech Translation System
The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations.