Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition

  title={Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition},
  author={Chanwoo Kim and Kwangyoun Kim and Sathish Reddy Indurthi},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Chanwoo KimKwangyoun KimS. Indurthi
  • Published 15 February 2020
  • Computer Science, Physics
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain… 

Figures and Tables from this paper

Macro-Block Dropout for Improved Regularization in Training End-to-End Speech Recognition Models

This work defines a macro-block that contains a large number of units from the input to a Recurrent Neural Network (RNN) and applies random dropout to each macro-blocks, which has the effect of applying different drop out rates for each layer even if the authors keep a constant average dropout rate.

Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition

The results show that the proposed augmentation methods can bring statistically significant improvement on the performance of the state-of-the-art SpecAugment.

Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition

The proposed utterance invariant training combines three different types of conditioning namely, concatenative, multiplicative and additive, which shows reduction in word error rates up to 7% relative on Librispeech, and 10-15% on a large scale Korean end-to-end two-pass hybrid ASR model.

On the limit of English conversational speech recognition

The conformer shows similar performance to the LSTM; nevertheless, their combination and decoding with an improved LM reaches a new record on Switchboard-300, and a new state of the art is reported, practically reaching the limit of the benchmark.

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing

This paper presents a streaming on-device end-to-end speech recognition solution for a privacy sensitive voice-typing application which primarily involves typing user private details and passwords and explores domain biasing using a shallow fusion with a weighted finite state transducer.

Towards Explainable Classifiers Using the Counterfactual Approach - Global Explanations for Discovering Bias in Data

Using the proposed method, a number of possible bias-causing artifacts are successfully identified and confirmed in dermoscopy images and it is confirmed that black frames have a strong influence on Convolutional Neural Network’s prediction: 22% of them changed the prediction from benign to malignant.

IAES International Journal of Artificial Intelligence (IJ-AI)

A modified MCA that has been successfully applied to various optimization problems is presented, based on using harmony search (HS), which provides more exploitation and intensification and gives the proposed method the superiority over the original algorithm.



Power-Normalized Cepstral Coefficients (PNCC) for Robust Speech Recognition

  • Chanwoo KimR. Stern
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2016
Experimental results demonstrate that PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for speech in the presence of various types of additive noise and in reverberant environments, with only slightly greater computational cost than conventional MFCC processing.

Robust speech recognition using a Small Power Boosting algorithm

The experimental results indicate that this simple idea of intentionally boosting the power of time-frequency bins with small energy for both the training and testing datasets is very helpful for very difficult noisy environments such as corruption by background music.

Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System

An improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique using the shallow-fusion technique with a Transformer LM and an attentionbased end-to-end speech recognition system without using any Language Models (LMs).

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around

Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition

A new algorithm for designing the compressive nonlinearity in a data-driven way is developed, which is much more flexible than the previous approaches and may be extended to other domains as well.

Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction

Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise.

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional

End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System

The authors' end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM).

Improved training of end-to-end attention models for speech recognition

This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.

Attention-Based Models for Speech Recognition

The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.