Speech Enhancement for Wake-Up-Word detection in Voice Assistants

  title={Speech Enhancement for Wake-Up-Word detection in Voice Assistants},
  author={David Bonet and Guillermo C'ambara and Fernando L'opez and Pablo G'omez and Carlos Segura and Jordi Luque},
Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants. A very common issue of voice assistants is that they get easily activated by background noise like music, TV or background speech that accidentally triggers the device. In this paper, we propose a Speech Enhancement (SE) model adapted to the task of WUW detection that aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises. The SE… 

Figures and Tables from this paper

Efficient Keyword Spotting through long-range interactions with Temporal Lambda Networks
Recent models based on attention mechanisms have shown unprecedented performance in the speech recognition domain. These are computational expensive and unnecessarily complex for the keyword spotting
Efficient Keyword Spotting by capturing long-range interactions with Temporal Lambda Networks
Models based on attention mechanisms have shown unprecedented speech recognition performance. However, they are computationally expensive and unnecessarily complex for keyword spotting, a task


A scalable noisy speech dataset and online subjective test framework
A noisy speech dataset (MS-SNSD) that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio levels desired and an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing.
Detection of Speech Events and Speaker Characteristics through Photo-Plethysmographic Signal Neural Processing
This work explores several end-to-end convolutional neural network architectures for detection of human’s characteristics such as gender or person identity and evaluates whether speech/non-speech events may be inferred from PPG signal, where speech might translate in fluctuations into the pulse signal.
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR
It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.
The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms
The evaluation of five baseline Vad systems on the QUT-NOISE-TIMIT corpus is conducted to validate the data and show that the variety of noise available will allow for better evaluation of VAD systems than existing approaches in the literature.
An Investigation into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription
This approach stands in contrast and delivers larger gains than the common strategy reported in the literature to augment the training database with additional artificially degraded speech, and achieves a new single-system state-of-the-art result on the CHiME-5 data.
A Wavenet for Speech Denoising
The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.
A Fully Convolutional Neural Network for Speech Enhancement
The proposed network, Redundant Convolutional Encoder Decoder (R-CED), demonstrates that a convolutional network can be 12 times smaller than a recurrent network and yet achieves better performance, which shows its applicability for an embedded system: the hearing aids.
SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Recurrent Neural Networks for Noise Reduction in Robust ASR
This work introduces a model which uses a deep recurrent auto encoder neural network to denoise input features for robust ASR, and demonstrates the model is competitive with existing feature denoising approaches on the Aurora2 task, and outperforms a tandem approach where deep networks are used to predict phoneme posteriors directly.