• Publications
  • Influence
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the
Improved Noisy Student Training for Automatic Speech Recognition
TLDR
This work adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method and finding effective methods to filter, balance and augment the data generated in between self-training iterations.
Efficient Knowledge Distillation for RNN-Transducer Models
TLDR
This paper develops a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition, and studies the effectiveness of the proposed approach in improving the accuracy of sparse Rnn-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation.
Specaugment on Large Scale Datasets
TLDR
This paper demonstrates its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset and introduces a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks.
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition
TLDR
It is found that the combination of pretraining, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data.
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
TLDR
SpeechStew is a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal, and it is demonstrated that SpeechStew learns powerful transfer learning representations.
Towards NNGP-guided Neural Architecture Search
TLDR
It is proposed that NNGP performance is an inexpensive signal independent of metrics obtained from training that can either be used for reducing big search spaces, or improving training-based performance measures.
Universal Paralinguistic Speech Representations Using Self-Supervised Conformers
TLDR
This work introduces a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture and demonstrates that simple linear classifiers trained on top of this time-averaged representation outperform nearly all previous results.
The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
TLDR
It is found that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions, and in the absence of batch normalization, the optimal normalized noise scale is directly proportional to width.
...
...