• Publications
  • Influence
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Conformer: Convolution-augmented Transformer for Speech Recognition
TLDR
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
TLDR
This work presents the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far, and shows that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.
Monotonic Chunkwise Attention
TLDR
Monotonic Chunkwise Attention (MoChA), which adaptively splits the input sequence into small chunks over which soft attention is computed, is proposed and shown that models utilizing MoChA can be trained efficiently with standard backpropagation while allowing online and linear-time decoding at test time.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
TLDR
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework.
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the
Improved Noisy Student Training for Automatic Speech Recognition
TLDR
This work adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method and finding effective methods to filter, balance and augment the data generated in between self-training iterations.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
TLDR
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models
TLDR
This work considers two loss functions which approximate the expected number of word errors: either by sampling from the model, or by using N-best lists of decoded hypotheses, which are found to be more effective than the sampling-based method.
...
...