Unsupervised Cross-lingual Representation Learning for Speech Recognition

  title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
  author={Alexis Conneau and Alexei Baevski and Ronan Collobert and Abdel-rahman Mohamed and Michael Auli},
This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on a concurrently introduced self-supervised model which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly… 

Figures and Tables from this paper

Towards Semi-Supervised Semantics Understanding from Speech
Experiments show that the proposed SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition
The A2 framework overcomes the long-tail problem via three techniques: exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; proposing dual adapters consisting of both language-specific and language-agnostic adaptation with minimal additional parameters; and overcoming the class imbalance.
Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations
This work utilizes vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents in a sequence-to-sequence (seq2seq) framework.
Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios
This work trains a model on the Wilderness dataset and investigates how its latent space compares with classical language family findings, providing a new direction for cross-lingual data augmentation in any speech-based NLP task.
Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning
This work proposes a supervised contrastive learning (SCL) objective for the fine-tuning stage of natural language understanding classification models and demonstrates that the new objective leads to models that are more robust to different levels of noise in the training data, and can generalize better to related tasks with limited labeled task data.
The CogALex Shared Task on Monolingual and Multilingual Identification of Semantic Relations
Top performance was achieved by a transformer-based model in both the monolingual and in the multilingual setting, for all the tested languages, proving the potentials of this recently-introduced neural architecture.
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
A meta-analysis of the performance of speech recognition adaptation algorithms is presented, based on relative error rate reductions as reported in the literature, to characterize adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation.
Automatic Speech Recognition and Query By Example for Creole Languages Documentation
It is proposed to use about one hour of annotated data to design an automatic speech recognition system for each language, and how much data is needed to obtain a query-by-example system that is usable by linguists is evaluated.
Fine-tuning pre-trained models for Automatic Speech Recognition, experiments on a fieldwork corpus of Japhug (Trans-Himalayan family)
A deep learning approach based on the language-specific tuning of a generic pre-trained representation model, XLS-R, using a Transformer architecture allows for reaching the stage of automatic word recognition in Japhug.
ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation and highlights that self-supervised models trained on smaller sets of target data are more effective to low- resource end-to-end ST fine-tuning, compared to large off-the-shelf models.


Unsupervised Pretraining Transfers Well Across Languages
It is shown that a slight modification of the CPC pretraining extracts features that transfer well to other languages, being on par or even outperforming supervised pretraining, shows the potential of unsupervised methods for languages with few linguistic resources.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language
  • V. Le, L. Besacier
  • Computer Science
    IEEE Transactions on Audio, Speech, and Language Processing
  • 2009
Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingUAL context dependent modeling, however, it was outperformed by the latter one, when more speech data were available, and it was concluded that in both cases,Crosslingual systems are better than monolingual baseline systems.
Multilingual Speech Recognition with a Single End-to-End Model
This model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually and improves performance by an additional 7% relative and eliminate confusion between different languages.
Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages.
An Unsupervised Autoregressive Model for Speech Representation Learning
Speech representations learned by the proposed unsupervised autoregressive neural model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsuper supervised approaches.
Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models
This work reports experiments on a different approach to multilingual speech recognition, in which the phone sets are entirely distinct but the model has parameters not tied to specific states that are shared across languages.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being
Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling
Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages.