• Corpus ID: 239049755

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

  title={SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training},
  author={Ankur Bapna and Yu-An Chung and Na Wu and Anmol Gulati and Ye Jia and J. Clark and Melvin Johnson and Jason Riesa and Alexis Conneau and Yu Zhang},
Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-training within a single model. We build a single encoder with the BERT objective on unlabeled text… 

Figures and Tables from this paper

mSLAM: Massively multilingual joint pre-training for speech and text
mSLAM is evaluated on several downstream speech understanding tasks and finds that joint pre-training with text improves quality on speech translation, speech intent classification and speech languageID while being competitive on multilingual ASR, when compared against speech-only pre- training.
Unified Speech-Text Pre-training for Speech Translation and Recognition
Experiments show the proposed method can effectively fuse speech and text information into one model and achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.
A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
This work proposes a framework, Alignment-Aware Acoustic-Text Pretraining (A 3 T), which recon-structs masked acoustic signals with text input and acoustic-text alignment during training and can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly.
Self-Supervised Speech Representation Learning: A Review
This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.
MAESTRO: Matched Speech Text Representations through Modality Matching
Maestro is a novel algorithm to learn unified representations from both speech and text modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST).
A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled
Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
This work explores multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2.
Non-Parametric Domain Adaptation for End-to-End Speech Translation
A novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system is proposed and demonstrates that when in-domain text translation data is involved only, this approach improves baseline by 12.82 BLEU on average.
The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task
Experimental results show that the YiTrans system obtains a sig-nificant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year’s op-timal end-to-end system on tst2021 English-German.
The YiTrans Speech Translation System for IWSLT 2022 Offline Shared Task
Experimental results show that the YiTrans system obtains a significant improvement than the strong baseline on three translation directions, and it achieves +5.2 BLEU improvements over last year’s optimal end-to-end system on tst2021 English-German.


w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation
Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation
A Fused Acoustic and Text Masked Language Model (FATMLM) is proposed which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data.
Injecting Text in Self-Supervised Speech Pretraining
The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text.
Large-Scale Self- and Semi-Supervised Learning for Speech Translation
This paper explores both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with Common-Crawl, and effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Unsupervised Cross-lingual Representation Learning for Speech Recognition
XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.
Unsupervised Speech Recognition
Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3 and rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
SpeechStew is a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal, and it is demonstrated that SpeechStew learns powerful transfer learning representations.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
It is shown that the combination of pretraining, self-training and scaling up model size greatly increases dataency, even for extremely large tasks with tens of thousands of hours of labeled data, as well as obtaining SoTA performance on many public benchmarks.
wav2vec: Unsupervised Pre-training for Speech Recognition
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.