Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

  title={Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation},
  author={Ye Jia and Yifan Ding and Ankur Bapna and Colin Cherry and Yu Zhang and Alexis Conneau and Nobuyuki Morioka},
End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more… 

Figures and Tables from this paper



SpanBERT: Improving Pre-training by Representing and Predicting Spans

The approach extends BERT by masking contiguous random spans, rather than random tokens, and training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it.

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

CVSS, a massively multilingual-to-English speech- to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English, is introduced and the performance of the direct S2 ST models approaches the strong cascade baselines when trained from scratch.

mSLAM: Massively multilingual joint pre-training for speech and text

mSLAM is evaluated on several downstream speech understanding tasks and finds that joint pre-training with text improves quality on speech translation, speech intent classification and speech languageID while being competitive on multilingual ASR, when compared against speech-only pre- training.

Conformer: Convolution-augmented Transformer for Speech Recognition

This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.

Textless Speech-to-Speech Translation on Real Data

To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.

Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

This work presents an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space, and obtains more than one thousand three hundred hours of aligned speech in French, German, Spanish and English.

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

XLS-R is presented, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0 that improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average.

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

It is demonstrated that incorporating both speech and text data during pre-training can significantly improve downstream quality on CoVoST 2 speech translation, by around 1 BLEU compared to single-modality pre-trained models, while retaining close to SotA performance on LibriSpeech and SpeechStew ASR tasks.

Direct simultaneous speech to speech translation

We present the first direct simultaneous speech-to-speech translation (Simul-S2ST) model, with the ability to start generating translation in the target speech before consuming the full source speech