• Corpus ID: 14140670

Vocal Tract Length Perturbation (VTLP) improves speech recognition

  title={Vocal Tract Length Perturbation (VTLP) improves speech recognition},
  author={Navdeep Jaitly and E. Hinton},
Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. [] Key Method In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than tting a single warp factor to each training and test speaker (or utterance).

Tables from this paper

Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System
An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).
Neural VTLN for Speaker Adaptation in TTS
Experimental results show that the DNN is capable of predicting phonedependent warpings on artificial data, and that such warpings improve the quality of an acoustic model on real data in subjective listening tests.
Vocal Tract Length Perturbation for Text-Dependent Speaker Verification With Autoregressive Prediction Coding
This letter explores the bottleneck (BN) feature extracted by training deep neural networks with a self-supervised learning objective, autoregressive predictive coding, for TD-SV and applies the proposed VTL method to APC and speaker-discriminant BN features.
Speech Augmentation Using Wavenet in Speech Recognition
  • Jisung Wang, Sangki Kim, Yeha Lee
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This work proposes a voice conversion approach using a generative model (WaveNet), which generates a new utterance by transforming an utterance to a given target voice and proves that this method led to better generalization compared to other data augmentation techniques such as speed perturbation and WORLD-based voice conversion.
A Comparison of Streaming Models and Data Augmentation Methods for Robust Speech Recognition
A comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T).
Discriminatively trained joint speaker and environment representations for adaptation of deep neural network acoustic models
This paper proposes a novel approach for estimating a compact joint representation of speakers and environment by training a DNN, with a bottleneck layer, to classify the i-vector features into speaker and environment labels by Multi-Task Learning (MTL).
In this paper, we report some recent improvements to DNN/HMM hybrid acoustic modeling for the EML real-time large vocabulary speech recognition system, including the introduction of speaker adaptive
Data Independent Sequence Augmentation Method for Acoustic Scene Classification
This paper investigates a novel sequence augmentation method for long short-term memory (LSTM) acoustic modeling to deal with data sparsity in acoustic scene classification tasks and shows performance improvements of the proposed methods.
Modulation spectrum augmentation for robust speech recognition
The main contribution of this paper is to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner and develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cEPstral speech feature vector sequences, to further enhance robustness.


A frequency warping approach to speaker normalization
An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Acoustic Modeling Using Deep Belief Networks
It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
The Kaldi Speech Recognition Toolkit
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Gradient-based learning applied to document recognition
This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.
Best practices for convolutional neural networks applied to visual document analysis
A set of concrete bestpractices that document analysis researchers can use to get good results with neural networks, including a simple "do-it-yourself" implementation of convolution with a flexible architecture suitable for many visual document problems.
High-Performance Neural Networks for Visual Object Classification
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a
A Fast Learning Algorithm for Deep Belief Nets
A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Rectified Linear Units Improve Restricted Boltzmann Machines
Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.