Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition
@article{Kim2020SmallEM, title={Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition}, author={Chanwoo Kim and Kwangyoun Kim and Sathish Reddy Indurthi}, journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2020}, pages={7684-7688} }
In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain…
Figures and Tables from this paper
7 Citations
Macro-Block Dropout for Improved Regularization in Training End-to-End Speech Recognition Models
- Computer Science2022 IEEE Spoken Language Technology Workshop (SLT)
- 2023
This work defines a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN) and applies random dropout to each macro-blocks, which has the effect of applying different drop out rates for each layer even if the authors keep a constant average dropout rate.
Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
The results show that the proposed augmentation methods can bring statistically significant improvement on the performance of the state-of-the-art SpecAugment.
Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition
- Computer ScienceINTERSPEECH
- 2020
The proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive, which shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.
On the limit of English conversational speech recognition
- Computer ScienceInterspeech
- 2021
The conformer shows similar performance to the LSTM; nevertheless, their combination and decoding with an improved LM reaches a new record on Switchboard-300, and a new state of the art is reported, practically reaching the limit of the benchmark.
Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing
- Computer ScienceINTERSPEECH
- 2020
This paper presents a streaming on-device end-to-end speech recognition solution for a privacy sensitive voice-typing application which primarily involves typing user private details and passwords and explores domain biasing using a shallow fusion with a weighted finite state transducer.
Towards Explainable Classifiers Using the Counterfactual Approach - Global Explanations for Discovering Bias in Data
- Computer ScienceJ. Artif. Intell. Soft Comput. Res.
- 2021
Using the proposed method, a number of possible bias-causing artifacts are successfully identified and confirmed in dermoscopy images and it is confirmed that black frames have a strong influence on Convolutional Neural Network’s prediction: 22% of them changed the prediction from benign to malignant.
IAES International Journal of Artificial Intelligence (IJ-AI)
- Computer Science
- 2021
A modified MCA that has been successfully applied to various optimization problems is presented, based on using harmony search (HS), which provides more exploitation and intensification and gives the proposed method the superiority over the original algorithm.
References
SHOWING 1-10 OF 23 REFERENCES
Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2016
Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing.
Robust speech recognition using a Small Power Boosting algorithm
- Computer Science2009 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2009
The experimental results indicate that this simple idea of intentionally boosting the power of time-frequency bins with small energy for both the training and testing datasets is very helpful for very difficult noisy environments such as corruption by background music.
Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System
- Computer ScienceINTERSPEECH
- 2019
An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).
Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around…
Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
A new algorithm for designing the compressive nonlinearity in a data-driven way is developed, which is much more flexible than the previous approaches and may be extended to other domains as well.
Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction
- EngineeringINTERSPEECH
- 2009
Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional…
End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).
Improved training of end-to-end attention models for speech recognition
- Computer ScienceINTERSPEECH
- 2018
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.
Attention-Based Models for Speech Recognition
- Computer ScienceNIPS
- 2015
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.