Vocal Tract Length Perturbation (VTLP) improves speech recognition
@inproceedings{Jaitly2013VocalTL, title={Vocal Tract Length Perturbation (VTLP) improves speech recognition}, author={Navdeep Jaitly and E. Hinton}, year={2013} }
Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. [] Key Method In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than tting a single warp factor to each training and test speaker (or utterance).
Tables from this paper
283 Citations
Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System
- Computer ScienceINTERSPEECH
- 2019
An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).
Neural VTLN for Speaker Adaptation in TTS
- Computer Science10th ISCA Workshop on Speech Synthesis (SSW 10)
- 2019
Experimental results show that the DNN is capable of predicting phonedependent warpings on artificial data, and that such warpings improve the quality of an acoustic model on real data in subjective listening tests.
Speech Augmentation Using Wavenet in Speech Recognition
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This work proposes a voice conversion approach using a generative model (WaveNet), which generates a new utterance by transforming an utterance to a given target voice and proves that this method led to better generalization compared to other data augmentation techniques such as speed perturbation and WORLD-based voice conversion.
A Comparison of Streaming Models and Data Augmentation Methods for Robust Speech Recognition
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
A comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T).
Discriminatively trained joint speaker and environment representations for adaptation of deep neural network acoustic models
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
This paper proposes a novel approach for estimating a compact joint representation of speakers and environment by training a DNN, with a bottleneck layer, to classify the i-vector features into speaker and environment labels by Multi-Task Learning (MTL).
RECENT IMPROVEMENTS TO NEURAL NETWORK BASED ACOUSTIC MODELING IN THE EML REAL-TIME TRANSCRIPTION PLATFORM
- Computer Science
- 2019
In this paper, we report some recent improvements to DNN/HMM hybrid acoustic modeling for the EML real-time large vocabulary speech recognition system, including the introduction of speaker adaptive…
Data Independent Sequence Augmentation Method for Acoustic Scene Classification
- Computer ScienceINTERSPEECH
- 2018
This paper investigates a novel sequence augmentation method for long short-term memory (LSTM) acoustic modeling to deal with data sparsity in acoustic scene classification tasks and shows performance improvements of the proposed methods.
Modulation spectrum augmentation for robust speech recognition
- Computer ScienceAISS '19
- 2019
The main contribution of this paper is to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner and develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cEPstral speech feature vector sequences, to further enhance robustness.
Improving speech recognition using data augmentation and acoustic model fusion
- Computer ScienceKES
- 2017
Investigating a neural all pass warp in modern TTS applications
- Computer ScienceSpeech Communication
- 2022
References
SHOWING 1-9 OF 9 REFERENCES
A frequency warping approach to speaker normalization
- EngineeringIEEE Trans. Speech Audio Process.
- 1998
An efficient means for estimating a linear frequency Warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis are presented.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
- Computer Science2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2012
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Acoustic Modeling Using Deep Belief Networks
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2012
It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
The Kaldi Speech Recognition Toolkit
- Computer Science
- 2011
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.
Gradient-based learning applied to document recognition
- Computer ScienceProc. IEEE
- 1998
This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.
Best practices for convolutional neural networks applied to visual document analysis
- Computer ScienceSeventh International Conference on Document Analysis and Recognition, 2003. Proceedings.
- 2003
A set of concrete bestpractices that document analysis researchers can use to get good results with neural networks, including a simple "do-it-yourself" implementation of convolution with a flexible architecture suitable for many visual document problems.
High-Performance Neural Networks for Visual Object Classification
- Computer ScienceArXiv
- 2011
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a…
A Fast Learning Algorithm for Deep Belief Nets
- Computer ScienceNeural Computation
- 2006
A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Rectified Linear Units Improve Restricted Boltzmann Machines
- Computer ScienceICML
- 2010
Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.