Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

@article{Popuri2022EnhancedDS,
  title={Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation},
  author={Sravya Popuri and Peng-Jen Chen and Changhan Wang and Juan Miguel Pino and Yossi Adi and Jiatao Gu and Wei-Ning Hsu and Ann Lee},
  journal={ArXiv},
  year={2022},
  volume={abs/2204.02967}
}
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 53 REFERENCES
Direct Speech-to-Speech Translation With Discrete Units
TLDR
A direct speech-to-speech translation model that translates speech from one language to speech in another language without relying on intermediate text generation is presented and is comparable to models that predict spectrograms and are trained with text supervision.
Textless Speech-to-Speech Translation on Real Data
TLDR
To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.
Transformer-Based Direct Speech-To-Speech Translation with Transcoder
TLDR
A step-by-step scheme to a complete end-to-end speech- to-speech translation and a Transformer-based speech translation using Transcoder are proposed and a multi-task model using syntactically similar and distant language pairs is compared.
Sequence-to-Sequence Models Can Directly Translate Foreign Speech
TLDR
A recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another, illustrating the power of attention-based models.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
TLDR
Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
Self-Training for End-to-End Speech Translation
TLDR
This work leverages pseudo-labels generated from unlabeled audio by a cascade and an end-to-end speech translation model to provide gains over a strong semi-supervised baseline on the MuST-C English-French and English-German datasets, reaching state-of-the art performance.
Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation
TLDR
The best predictor of final AST performance is the word error rate of the pretrained ASR model, and it is found that differences in ASR/AST performance correlate with how phonetic information is encoded in the later RNN layers of the model.
Translatotron 2: Robust direct speech-to-speech translation
TLDR
Experimental results suggest that Translatotron 2 outperforms the original Translattron by a large margin in terms of translation quality and predicted speech naturalness, and drastically improves the robustness of the predicted speech by mitigating over-generation, such as babbling or long pause.
Back-Translation-Style Data Augmentation for end-to-end ASR
TLDR
Inspired by the back-translation technique proposed in the field of machine translation, a neural text-to-encoder model is built which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from asequence of characters.
...
...