Trainable frontend for robust and far-field keyword spotting
@article{Wang2017TrainableFF, title={Trainable frontend for robust and far-field keyword spotting}, author={Yuxuan Wang and Pascal Getreuer and Thad Hughes and Richard F. Lyon and Rif A. Saurous}, journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2017}, pages={5670-5674} }
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our…
76 Citations
Supervised Noise Reduction for Multichannel Keyword Spotting
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper introduces the idea of combining microphone-array speech enhancement with machine learning, by incorporating a feedback path from the neural network KWS classifier to its signal preprocessing frontend so that frontend noise reduction can benefit from, and in turn, better serve backend machine intelligence.
Small Footprint Multi-channel ConvMixer for Keyword Spotting with Centroid Based Awareness
- Computer ScienceArXiv
- 2022
A centroid based awareness component is proposed to enhance the system by equipping it with additional spatial geometry information in the latent feature projection space to achieve better noise-robust features with more efficient computation.
Parameterized Channel Normalization for Far-Field Deep Speaker Verification
- Computer Science2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2021
This work addresses far-field speaker verification with deep neural network (DNN) based speaker embedding extractor with two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization(PCMN), which contain differentiable parameters and thus can be conveniently integrated to, and jointly optimized with the DNN using automatic differentiation methods.
On Front-end Gain Invariant Modeling for Wake Word Spotting
- Computer ScienceINTERSPEECH
- 2020
A novel approach to use a new feature called $\Delta$LFBE to decouple the AFE gain variations from the WW model is proposed, modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged.
Integration of Multi-Look Beamformers for Multi-Channel Keyword Spotting
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper proposes integrating multiple beamformed signals and a microphone signal as input to an end-to-end KWS model and leveraging the attention mechanism to dynamically tune the model’s attention to the reliable input sources to significantly improves the KWS performance and reduces the computation cost.
Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection
- Economics, Computer ScienceINTERSPEECH
- 2018
It is demonstrated that KWD with TDSE frontend significantly outperforms the baseline KWD system with or without a generic speech enhancement in terms of equal error rate (EER) in the keyword detection evaluation.
End-to-end Models with auditory attention in Multi-channel Keyword Spotting
- Computer ScienceArXiv
- 2018
An attention-based end-to-end model for multi-channel keyword spotting is proposed, which is trained to optimize the KWS result directly and outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data.
Hotword Cleaner: Dual-microphone Adaptive Noise Cancellation with Deferred Filter Coefficients for Robust Keyword Spotting
- EngineeringICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
An STFT-based adaptive noise cancellation method modified to use deferred filter coefficients is proposed to extract hotwords out from noisy stereo microphone signals to improve noise robustness of hotword (wake-word) detection as a special application of keyword spotting.
Per-Channel Energy Normalization: Why and How
- PhysicsIEEE Signal Processing Letters
- 2019
This letter investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints and describes the asymptotic regimes in PCEN: temporal integration, gain control, and dynamic range compression.
Parametric Cepstral Mean Normalization for Robust Speech Recognition
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
Experimental results show that, in contrast to traditional CMN, which degrades performance on clean data, PCMN provides 5% relative improvement onclean data, while also providing 11.2% relative improved on far-field test data.
References
SHOWING 1-10 OF 18 REFERENCES
Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
It is found that system performance can be improved significantly, with relative improvements up to 75% in far-field conditions, by employing a combination of multi-style training and a proposed novel formulation of automatic gain control that estimates the levels of both speech and background noise.
MVA Processing of Speech Features
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2007
It is argued and demonstrated that MVA works better when applied to the zeroth-order cepstral coefficient than to log energy, that M VA works better in the cEPstral domain, and that an ARMA filter is better than either a designed finite impulse response filter or a data-driven filter.
Locally-connected and convolutional neural networks for small footprint speaker recognition
- Computer ScienceINTERSPEECH
- 2015
This work compares the performance of deep Locally-Connected Networks (LCN) and Convolutional Neural Networks (CNN) for text-dependent speaker recognition and shows that both a LCN and CNN can reduce the total model footprint to 30% of the original size compared to a baseline fully-connected DNN.
RASTA processing of speech
- Computer ScienceIEEE Trans. Speech Audio Process.
- 1994
The theoretical and experimental foundations of the RASTA method are reviewed, the relationship with human auditory perception is discussed, the original method is extended to combinations of additive noise and convolutional noise, and an application is shown to speech enhancement.
Small-footprint keyword spotting using deep neural networks
- Computer Science, Economics2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014
This application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision, and proposes a simple approach based on deep neural networks that achieves 45% relative improvement with respect to a competitive Hidden Markov Model-based system.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
- Computer ScienceIEEE Signal Processing Magazine
- 2012
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Optimal estimators for spectral restoration of noisy speech
- PhysicsICASSP
- 1984
Results for a speaker dependent connected digit speech recognition task with a base error rate of 1.6%, show that preprocessing the noisy unknown speech with a 10 dB signal-to-noise ratio reduces the error rate from 42% to 10%.
Convolutional neural networks for small-footprint keyword spotting
- Computer ScienceINTERSPEECH
- 2015
This work explores using Convolutional Neural Networks for a small-footprint keyword spotting task and finds that the CNN architectures offer between a 27-44% relative improvement in false reject rate compared to a DNN, while fitting into the constraints of each application.
Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification.
- PhysicsThe Journal of the Acoustical Society of America
- 1974
The cepstrum was found to be the most effective, providing an identification accuracy of 70% for speech 50 msec in duration, which increased to more than 98% for a duration of 0.5 sec.
Cascades of two-pole-two-zero asymmetric resonators are good models of peripheral auditory function.
- PhysicsThe Journal of the Acoustical Society of America
- 2011
A cascade of two-pole-two-zero filter stages is a good model of the auditory periphery that acts as an auditory filter model that provides an excellent fit to data on human detection of tones in masking noise, with fewer fitting parameters than previously reported filter models.