Direct speech-to-speech translation with a sequence-to-sequence model

  title={Direct speech-to-speech translation with a sequence-to-sequence model},
  author={Ye Jia and Ron J. Weiss and Fadi Biadsy and Wolfgang Macherey and Melvin Johnson and Z. Chen and Yonghui Wu},
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. [] Key Result We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this…

Figures and Tables from this paper

Direct Speech-to-Speech Translation With Discrete Units
A direct speech-to-speech translation model that translates speech from one language to speech in another language without relying on intermediate text generation is presented and is comparable to models that predict spectrograms and are trained with text supervision.
Direct simultaneous speech to speech translation
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech
Transformer-Based Direct Speech-To-Speech Translation with Transcoder
A step-by-step scheme to a complete end-to-end speech- to-speech translation and a Transformer-based speech translation using Transcoder are proposed and a multi-task model using syntactically similar and distant language pairs is compared.
Speech-to-Speech Translation Between Untranscribed Unknown Languages
This is the first work that performed pure speech-to-speech translation between untranscribed unknown languages and can directly generate target speech without any auxiliary or pre-training steps with a source or target transcription.
A Direct Speech-to-Speech Neural Network Methodology for Spanish-English Translation
A novel direct speech-to-speech methodology for translation based on an LSTM neural network structure, which belongs to the recently appeared idea of direct translation without text representation, as this sort of training better corresponds to the way oral language learning takes place in humans.
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency and the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation.
Incremental Speech Synthesis For Speech-To-Speech Translation
This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context.
Speech-to-Speech Translation without Text
This is the first work that performed pure speech-to-speech translation between untranscribed unknown languages and can directly generate target speech without any auxiliary or pre-training steps with source or target transcription.
Textless Speech-to-Speech Translation on Real Data
This work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and is a self-supervised unit-based speech normalization technique.
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
Self-supervised pre-training with unlabeled speech data and data augmentation for direct speech-to-speech translation models consistently improves model performance compared with multitask learning with a BLEU gain of 4.3-12.0.


Sequence-to-Sequence Models Can Directly Translate Foreign Speech
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.
Prosody Generation for Speech-to-Speech Translation
  • P. Agüero, J. Adell, A. Bonafonte
  • Linguistics, Computer Science
    2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
  • 2006
This work proposes the use of prosodic features in the original speech to produce prosody in the target language using an unsupervised clustering algorithm that finds, in a bilingual speech corpus, intonation clusters in the source speech which are relevant in thetarget speech.
Finite-state speech-to-speech translation
  • E. Vidal
  • Linguistics
    1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1997
A fully integrated approach to speech input language translation in limited domain applications is presented and results for a task in the framework of hotel front desk communication, with a vocabulary of about 700 words, are reported.
End-to-End Automatic Speech Translation of Audiobooks
Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup and hope that the speech translation baseline on this corpus will be challenged in the future.
The ATR Multilingual Speech-to-Speech Translation System
The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
  • Ye Jia, Melvin Johnson, Yonghui Wu
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
Personalising Speech-To-Speech Translation in the EMIME Project
An HMM statistical framework for both speech recognition and synthesis is employed which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition).
An end-to-end model for cross-lingual transformation of paralinguistic information
The long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language by reconstructing input acoustic features in the target language by proposing a method that can translate features from input speech to the output speech in continuous space.
End-to-End Spoken Language Translation
A method for translating spoken sentences from one language into spoken sentences in another language using a pyramidal-bidirectional recurrent network combined with a convolutional network to output sentence-level spectrograms in the target language.
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus
The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk.