Direct speech-to-speech translation with a sequence-to-sequence model
@inproceedings{Jia2019DirectST, title={Direct speech-to-speech translation with a sequence-to-sequence model}, author={Ye Jia and Ron J. Weiss and Fadi Biadsy and Wolfgang Macherey and Melvin Johnson and Z. Chen and Yonghui Wu}, booktitle={INTERSPEECH}, year={2019} }
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. [] Key Result We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this…
98 Citations
Direct Speech-to-Speech Translation With Discrete Units
- Computer ScienceACL
- 2022
A direct speech-to-speech translation model that translates speech from one language to speech in another language without relying on intermediate text generation is presented and is comparable to models that predict spectrograms and are trained with text supervision.
Direct simultaneous speech to speech translation
- Computer ScienceArXiv
- 2021
We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech…
Transformer-Based Direct Speech-To-Speech Translation with Transcoder
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
A step-by-step scheme to a complete end-to-end speech- to-speech translation and a Transformer-based speech translation using Transcoder are proposed and a multi-task model using syntactically similar and distant language pairs is compared.
Speech-to-Speech Translation Between Untranscribed Unknown Languages
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
This is the first work that performed pure speech-to-speech translation between untranscribed unknown languages and can directly generate target speech without any auxiliary or pre-training steps with a source or target transcription.
A Direct Speech-to-Speech Neural Network Methodology for Spanish-English Translation
- Computer ScienceEAI Endorsed Trans. Energy Web
- 2020
A novel direct speech-to-speech methodology for translation based on an LSTM neural network structure, which belongs to the recently appeared idea of direct translation without text representation, as this sort of training better corresponds to the way oral language learning takes place in humans.
Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention
- Computer Science
- 2021
Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency and the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation.
Incremental Speech Synthesis For Speech-To-Speech Translation
- Computer ScienceArXiv
- 2021
This work focuses on improving the incremental synthesis performance of TTS models, and proposes latency metrics tailored to S2ST applications, and investigates methods for latency reduction in this context.
Speech-to-Speech Translation without Text
- Computer Science
- 2020
This is the first work that performed pure speech-to-speech translation between untranscribed unknown languages and can directly generate target speech without any auxiliary or pre-training steps with source or target transcription.
Textless Speech-to-Speech Translation on Real Data
- Computer ScienceArXiv
- 2021
This work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and is a self-supervised unit-based speech normalization technique.
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation
- Computer ScienceArXiv
- 2022
Self-supervised pre-training with unlabeled speech data and data augmentation for direct speech-to-speech translation models consistently improves model performance compared with multitask learning with a BLEU gain of 4.3-12.0.
References
SHOWING 1-10 OF 49 REFERENCES
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
- Computer ScienceINTERSPEECH
- 2017
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.
Prosody Generation for Speech-to-Speech Translation
- Linguistics, Computer Science2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
- 2006
This work proposes the use of prosodic features in the original speech to produce prosody in the target language using an unsupervised clustering algorithm that finds, in a bilingual speech corpus, intonation clusters in the source speech which are relevant in thetarget speech.
Finite-state speech-to-speech translation
- Linguistics1997 IEEE International Conference on Acoustics, Speech, and Signal Processing
- 1997
A fully integrated approach to speech input language translation in limited domain applications is presented and results for a task in the framework of hotel front desk communication, with a vocabulary of about 700 words, are reported.
End-to-End Automatic Speech Translation of Audiobooks
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
Experimental results show that it is possible to train compact and efficient end-to-end speech translation models in this setup and hope that the speech translation baseline on this corpus will be challenged in the future.
The ATR Multilingual Speech-to-Speech Translation System
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2006
The ATR multilingual speech-to-speech translation (S2ST) system, which is mainly focused on translation between English and Asian languages, uses a parallel multilingual database consisting of over 600 000 sentences that cover a broad range of travel-related conversations.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
Personalising Speech-To-Speech Translation in the EMIME Project
- Computer ScienceACL
- 2010
An HMM statistical framework for both speech recognition and synthesis is employed which provides transformation mechanisms to adapt the synthesized voice in TTS (text-to-speech) using the recognized voice in ASR (automatic speech recognition).
An end-to-end model for cross-lingual transformation of paralinguistic information
- Computer Science, LinguisticsMachine Translation
- 2018
The long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language by reconstructing input acoustic features in the target language by proposing a method that can translate features from input speech to the output speech in continuous space.
End-to-End Spoken Language Translation
- Computer ScienceArXiv
- 2019
A method for translating spoken sentences from one language into spoken sentences in another language using a pyramidal-bidirectional recurrent network combined with a convolutional network to output sentence-level spectrograms in the target language.
Improved speech-to-text translation with the Fisher and Callhome Spanish-English speech translation corpus
- Computer ScienceIWSLT
- 2013
The Fisher and Callhome Spanish-English Speech Translation Corpus is introduced, supplementing existing LDC audio and transcripts with ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and English translations obtained on Amazon’s Mechanical Turk.