• Publications
  • Influence
Audio augmentation for speech recognition
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.
A study on data augmentation of reverberant speech for robust speech recognition
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification
The proposed self-attentive speaker embedding system is compared with a strong DNN embedding baseline on NIST SRE 2016 and it is found that the self-ATTentive embeddings achieve superior performance.
JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS
This paper tackles the problem of reverberant speech recognition using 5500 hours of simulated reverberant data using time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments.
An empirical exploration of CTC acoustic models
This paper presents an extensive exploration of CTC-based acoustic models applied to a variety of ASR tasks, including an empirical study of the optimal configuration and architectural variants for CTC.
An Investigation of Few-Shot Learning in Spoken Term Classification
A modification to the Model-Agnostic Meta-Learning (MAML) algorithm is proposed to solve the problem of spoken term classification as a few-shot learning problem and shows that this approach outperforms the conventional supervised learning approach and the original MAML.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Extensive evaluations on a wide variety of spoken language processing tasks, including voice conversion, automatic speech recognition, text to speech, and speaker identification, show the superiority of the proposed SpeechT5 framework.
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines
Auto-KWS 2021 challenge calls for automated machine learning (AutoML) solutions to automate the process of apply-ing machine learning to a customized keyword spotting task. Compared with other
Mixup Learning Strategies for Text-Independent Speaker Verification
This paper investigates the mixup learning strategy in training speaker-discriminative deep neural network (DNN) for better text-independent speaker verification and finds that mixup training improves the DNN’s speaker classification accuracy consistently without requiring any additional data sources.