wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- Alexei Baevski, Henry Zhou, Abdel-rahman Mohamed, Michael Auli
- Computer ScienceNeural Information Processing Systems
- 20 June 2020
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being…
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
- Myle Ott, Sergey Edunov, Michael Auli
- Computer ScienceNorth American Chapter of the Association for…
- 1 April 2019
Fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks and supports distributed training across multiple GPUs and machines.
wav2vec: Unsupervised Pre-training for Speech Recognition
- Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli
- Computer ScienceInterspeech
- 11 April 2019
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.
Pay Less Attention with Lightweight and Dynamic Convolutions
- Felix Wu, Angela Fan, Alexei Baevski, Y. Dauphin, Michael Auli
- Computer ScienceInternational Conference on Learning…
- 29 January 2019
It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention.
Unsupervised Cross-lingual Representation Learning for Speech Recognition
- Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdel-rahman Mohamed, Michael Auli
- Computer ScienceInterspeech
- 24 June 2020
XLSR is presented which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages to enable a single multilingual speech recognition model which is competitive to strong individual models.
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- Alexei Baevski, Steffen Schneider, Michael Auli
- Computer ScienceInternational Conference on Learning…
- 12 October 2019
Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition and the algorithm uses a gumbel softmax or online k-means clustering to quantize the dense representations.
Facebook FAIR’s WMT19 News Translation Task Submission
- Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov
- Computer ScienceConference on Machine Translation
- 15 July 2019
This paper describes Facebook FAIR’s submission to the WMT19 shared news translation task and achieves the best case-sensitive BLEU score for the translation direction English→Russian.
Adaptive Input Representations for Neural Language Modeling
- Alexei Baevski, Michael Auli
- Computer ScienceInternational Conference on Learning…
- 27 September 2018
Adapt input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity are introduced and a systematic comparison of popular choices for a self-attentional architecture is performed.
Unsupervised Speech Recognition
- Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli
- Computer ScienceNeural Information Processing Systems
- 24 May 2021
Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3 and rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
Cloze-driven Pretraining of Self-attention Networks
- Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, Michael Auli
- Computer ScienceConference on Empirical Methods in Natural…
- 19 March 2019
A new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems, including cloze-style word reconstruction task, and a detailed analysis of a number of factors that contribute to effective pretraining.
...
...