• Corpus ID: 239024618

Multi-Modal Pre-Training for Automated Speech Recognition

  title={Multi-Modal Pre-Training for Automated Speech Recognition},
  author={David Chan and Shalini Ghosh and Debmalya Chakrabarty and Bj{\"o}rn Hoffmeister},
Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach which… 

Figures and Tables from this paper


Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?
The Hidden-Unit BERT (HUBERT) model is proposed which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model and allows the pre- training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality.
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
  • Wei Han, Zhengdong Zhang, +6 authors Yonghui Wu
  • Engineering, Computer Science
  • 2020
This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.
Semi-Supervised End-to-End ASR via Teacher-Student Learning with Conditional Posterior Distribution
A novel T/S learning with conditional posterior distribution for encoder-decoder based ASR is proposed, which reduces WER by 19.2% relatively on the LibriSpeech benchmark, compared with a system trained using only paired data.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Multimodal Self-Supervised Learning of General Audio Representations
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
Contrastive Learning of General-Purpose Audio Representations
This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.
VideoBERT: A Joint Model for Video and Language Representation Learning
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning
This work adopts the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition performance under multimedia noise and applies a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher.
Multimodal Transformer for Unaligned Multimodal Language Sequences
Comprehensive experiments on both aligned and non-aligned multimodal time-series show that the MulT model outperforms state-of-the-art methods by a large margin, and empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed cross modal attention mechanism in MulT.
Conformer: Convolution-augmented Transformer for Speech Recognition
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.