Trainable frontend for robust and far-field keyword spotting

@article{Wang2017TrainableFF,
  title={Trainable frontend for robust and far-field keyword spotting},
  author={Yuxuan Wang and Pascal Getreuer and Thad Hughes and Richard F. Lyon and Rif A. Saurous},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2017},
  pages={5670-5674}
}
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our… 

Figures from this paper

Supervised Noise Reduction for Multichannel Keyword Spotting
TLDR
This paper introduces the idea of combining microphone-array speech enhancement with machine learning, by incorporating a feedback path from the neural network KWS classifier to its signal preprocessing frontend so that frontend noise reduction can benefit from, and in turn, better serve backend machine intelligence.
Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness
TLDR
A centroid based awareness component is proposed to enhance the system by equipping it with additional spatial geometry information in the latent feature projection space to achieve better noise-robust features with more efficient computation.
Parameterized Channel Normalization for Far-Field Deep Speaker Verification
TLDR
This work addresses far-field speaker verification with deep neural network (DNN) based speaker embedding extractor with two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization(PCMN), which contain differentiable parameters and thus can be conveniently integrated to, and jointly optimized with the DNN using automatic differentiation methods.
On Front-end Gain Invariant Modeling for Wake Word Spotting
TLDR
A novel approach to use a new feature called $\Delta$LFBE to decouple the AFE gain variations from the WW model is proposed, modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged.
Integration of Multi-Look Beamformers for Multi-Channel Keyword Spotting
TLDR
This paper proposes integrating multiple beamformed signals and a microphone signal as input to an end-to-end KWS model and leveraging the attention mechanism to dynamically tune the model’s attention to the reliable input sources to significantly improves the KWS performance and reduces the computation cost.
Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection
TLDR
It is demonstrated that KWD with TDSE frontend significantly outperforms the baseline KWD system with or without a generic speech enhancement in terms of equal error rate (EER) in the keyword detection evaluation.
End-to-end Models with auditory attention in Multi-channel Keyword Spotting
TLDR
An attention-based end-to-end model for multi-channel keyword spotting is proposed, which is trained to optimize the KWS result directly and outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data.
Hotword Cleaner: Dual-microphone Adaptive Noise Cancellation with Deferred Filter Coefficients for Robust Keyword Spotting
TLDR
An STFT-based adaptive noise cancellation method modified to use deferred filter coefficients is proposed to extract hotwords out from noisy stereo microphone signals to improve noise robustness of hotword (wake-word) detection as a special application of keyword spotting.
Per-Channel Energy Normalization: Why and How
TLDR
This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints and describes the asymptotic regimes in PCEN: temporal integration, gain control, and dynamic range compression.
Parametric Cepstral Mean Normalization for Robust Speech Recognition
TLDR
Experimental results show that, in contrast to traditional CMN, which degrades performance on clean data, PCMN provides 5% relative improvement onclean data, while also providing 11.2% relative improved on far-field test data.
...
...

References

SHOWING 1-10 OF 18 REFERENCES
Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks
TLDR
It is found that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control that estimates the levels of both speech and background noise.
MVA Processing of Speech Features
TLDR
It is argued and demonstrated that MVA works better when applied to the zeroth-order cepstral coefficient than to log energy, that M VA works better in the cEPstral domain, and that an ARMA filter is better than either a designed finite impulse response filter or a data-driven filter.
Locally-connected and convolutional neural networks for small footprint speaker recognition
TLDR
This work compares the performance of deep Locally-Connected Networks (LCN) and Convolutional Neural Networks (CNN) for text-dependent speaker recognition and shows that both a LCN and CNN can reduce the total model footprint to 30% of the original size compared to a baseline fully-connected DNN.
RASTA processing of speech
TLDR
The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement.
Small-footprint keyword spotting using deep neural networks
TLDR
This application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision, and proposes a simple approach based on deep neural networks that achieves 45% relative improvement with respect to a competitive Hidden Markov Model-based system.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Optimal estimators for spectral restoration of noisy speech
TLDR
Results for a speaker dependent connected digit speech recognition task with a base error rate of 1.6%, show that preprocessing the noisy unknown speech with a 10 dB signal-to-noise ratio reduces the error rate from 42% to 10%.
Convolutional neural networks for small-footprint keyword spotting
TLDR
This work explores using Convolutional Neural Networks for a small-footprint keyword spotting task and finds that the CNN architectures offer between a 27-44% relative improvement in false reject rate compared to a DNN, while fitting into the constraints of each application.
Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification.
  • B. Atal
  • Physics
    The Journal of the Acoustical Society of America
  • 1974
TLDR
The cepstrum was found to be the most effective, providing an identification accuracy of 70% for speech 50 msec in duration, which increased to more than 98% for a duration of 0.5 sec.
Cascades of two-pole-two-zero asymmetric resonators are good models of peripheral auditory function.
  • R. Lyon
  • Physics
    The Journal of the Acoustical Society of America
  • 2011
TLDR
A cascade of two-pole-two-zero filter stages is a good model of the auditory periphery that acts as an auditory filter model that provides an excellent fit to data on human detection of tones in masking noise, with fewer fitting parameters than previously reported filter models.
...
...