Corpus ID: 235421619

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

@article{Hsu2021HuBERTSS,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Wei-Ning Hsu and Benjamin Bolte and Yao-Hung Hubert Tsai and Kushal Lakhotia and R. Salakhutdinov and Abdelrahman Mohamed},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.07447}
}
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to… Expand

Figures and Tables from this paper

Injecting Text in Self-Supervised Speech Pretraining
Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from twoExpand
Direct speech-to-speech translation with discrete units
  • Ann Lee, Peng-Jen Chen, +8 authors Wei-Ning Hsu
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
This work presents a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation and designs a multitask learning framework with joint speech and text training that enables the model to generate dual mode output simultaneously in the same inference pass. Expand
Text-Free Prosody-Aware Generative Spoken Language Modeling
  • E. Kharitonov, Ann Lee, +8 authors Wei-Ning Hsu
  • Computer Science, Engineering
  • 2021
Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barelyExpand
Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development
This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly,Expand
A Longitudinal Normative Dataset and Protocol for Speech and Language Biomarker Research
Although speech and language biomarker (SLB) research studies have shown methodological and clinical promise, some common limitations of these studies include small sample sizes, limited longitudinalExpand
W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training
  • Yu-An Chung, Yu Zhang, +4 authors Yonghui Wu
  • Computer Science, Engineering
  • ArXiv
  • 2021
Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2vBERT that explores MLM for self-supervised speech representation learning.Expand
fairseq S^2: A Scalable and Integrable Speech Synthesis Toolkit
  • Changhan Wang, Wei-Ning Hsu, +5 authors J. Pino
  • Engineering, Computer Science
  • 2021
This paper presents FAIRSEQ S, a FAIRSEQ extension for speech synthesis. We implement a number of autoregressive (AR) and non-AR text-to-speech models, and their multi-speaker variants. To enableExpand
SUPERB: Speech processing Universal PERformance Benchmark
TLDR
A simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model for its preferable re-usability and results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SuperB tasks. Expand

References

SHOWING 1-10 OF 65 REFERENCES
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
TLDR
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues. Expand
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
TLDR
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task. Expand
Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction
TLDR
It is found that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeling and labeled data come from different domains. Expand
Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition
TLDR
This work first exploits a large amount of unlabeled audio data via representation learning, where it reconstructs a temporal slice of filterbank features from past and future context frames to train a CTC-based end-to-end ASR system using a smaller amount of labeled audio data. Expand
Generative Pre-Training for Speech with Autoregressive Predictive Coding
  • Yu-An Chung, James R. Glass
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
This paper proposes to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations. Expand
Semi-Supervised Speech Recognition via Local Prior Matching
TLDR
This work proposes local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to a discriminative model trained on unlabeled speech. Expand
Self-Training for End-to-End Speech Recognition
  • Jacob Kahn, Ann Lee, Awni Y. Hannun
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
It is demonstrated that training with pseudo-labels can substantially improve the accuracy of a baseline model and is revisit self-training in the context of end-to-end speech recognition. Expand
wav2vec: Unsupervised Pre-training for Speech Recognition
TLDR
Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data. Expand
A Nonparametric Bayesian Approach to Acoustic Model Discovery
TLDR
An unsupervised model is presented that simultaneously segments the speech, discovers a proper set of sub-word units and learns a Hidden Markov Model for each induced acoustic unit and outperforms a language-mismatched acoustic model. Expand
DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization
TLDR
This work proposes DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization, which uses Transformers in encoding module instead of LSTMs and proposes an objective that combines the reconstructive loss withvector quantization diversity loss to train speech representations. Expand
...
1
2
3
4
5
...