• Corpus ID: 7360763

Audio augmentation for speech recognition

@inproceedings{Ko2015AudioAF,
  title={Audio augmentation for speech recognition},
  author={Tom Ko and Vijayaditya Peddinti and Daniel Povey and Sanjeev Khudanpur},
  booktitle={INTERSPEECH},
  year={2015}
}
Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. [] Key Method The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

Figures and Tables from this paper

Two-Stage Data Augmentation for Low-Resourced Speech Recognition
TLDR
An analysis exploring why multiple, complementary augmentation approaches to increasing the amount of training data are beneficial on low resourced languages from the IARPA Babel program are presented.
Speech Augmentation Using Wavenet in Speech Recognition
  • Jisung Wang, Sangki Kim, Yeha Lee
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This work proposes a voice conversion approach using a generative model (WaveNet), which generates a new utterance by transforming an utterance to a given target voice and proves that this method led to better generalization compared to other data augmentation techniques such as speed perturbation and WORLD-based voice conversion.
ImportantAug: a data augmentation agent for speech
TLDR
The proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added, and also provides a 25.4% error rate reduction compared to a baseline without data augmentation.
Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation
TLDR
This work proposes to expand the training set by using different audio codecs at the data level by using changed bit rate, sampling rate, and bit depth, and reassures variation in the input data without drastically affecting the audio quality.
Modulation spectrum augmentation for robust speech recognition
TLDR
The main contribution of this paper is to warp the intermediate representation of the cepstral feature vector sequence of an utterance in a holistic manner and develop a two-stage augmentation approach, which successively conduct perturbation in the waveform domain and warping in different modulation domains of cEPstral speech feature vector sequences, to further enhance robustness.
A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems
TLDR
This paper presents a survey about data augmentation techniques and its effect on Automatic Speech Recognition systems, some experiments were carried out to support the hypothesis that adding noise is not allways help.
A study on data augmentation of reverberant speech for robust speech recognition
TLDR
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Data Augmentation for Training of Noise Robust Acoustic Models
TLDR
This paper compares acoustic models trained on speech corpora with artificially added noises of different origins and reverberation and finds word recognition accuracy improvement over the baseline model trained on clean headset recordings reaches 45%.
SpecMix : A Mixed Sample Data Augmentation method for Training withTime-Frequency Domain Features
TLDR
A mixed sample data augmentation strategy is proposed to enhance the performance of models on audio scene classification, sound event classification, and speech enhancement tasks by applying time-frequency masks effective in preserving the spectral correlation of each audio sample.
Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation
TLDR
This paper examines the influence of three data augmentation methods on the performance of two S2S model architectures – a time perturbation in the frequency domain and sub-sequence sampling and their own development.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Data augmentation for low resource languages
TLDR
Two data augmentation schemes, semisupervised training and vocal tract length perturbation, are examined and combined on the Babel limited language pack configuration and consistent speech recognition performance gains can be obtained.
Data Augmentation for Deep Neural Network Acoustic Modeling
TLDR
Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM) for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity are investigated.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
TLDR
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Elastic spectral distortion for low resource speech recognition with deep neural networks
TLDR
An elastic spectral distortion method to artificially augment training samples to help DNN-HMMs acquire enough robustness even when there are a limited number of training samples is investigated.
Improving deep neural network acoustic models using generalized maxout networks
TLDR
This paper introduces two new types of generalized maxout units, which they are called p-norm and soft-maxout, and presents a method to control that instability during training when training unbounded-output nonlinearities.
An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech
  • W. Verhelst, M. Roelands
  • Computer Science
    1993 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1993
TLDR
The resulting WSOLA (waveform-similarity-based synchronized overlap-add) algorithm produces high-quality speech output, is algorithmically and computationally efficient and robust, and allows for online processing with arbitrary time-scaling factors.
A pitch extraction algorithm tuned for automatic speech recognition
TLDR
An algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems, which give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages.
Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging
TLDR
Another method is described, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow the periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.
Support vector machines for noise robust ASR
TLDR
Tree-based reduction approaches for multiclass classification are described, as well as some of the issues in applying them to dynamic data, such as speech.
...
...