• Corpus ID: 245650705

Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

@article{Wang2022GeneratingAS,
  title={Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words},
  author={Haoxu Wang and Yan Jia and Zeqing Zhao and Xuyang Wang and Junjie Wang and Ming Li},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.00167}
}
Wake-up word detection models are widely used in real life, but suffer from severe performance degradation when encountering adversarial samples. In this paper we discuss the concept of confusing words in adversarial samples. Confusing words are commonly encountered, which are various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system’s robustness against confusing words, we propose several methods to generate the adversarial confusing… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 28 REFERENCES

Wake Word Detection with Streaming Transformers

TLDR
This paper explores the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks.

Training Keyword Spotters with Limited and Synthesized Speech Data

TLDR
This paper uses a pre-trained speech embedding model trained to extract useful features for keyword spotting models, and shows that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.

Domain Aware Training for Far-Field Small-Footprint Keyword Spotting

TLDR
This paper develops three domain aware training systems, including the domain embedding system, the deep CORAL system, and the multi-task learning system, which incorporate domain knowledge into network training and improve the performance of the keyword classifier on far-field conditions.

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI

TLDR
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI.

Temporal Convolution for Real-time Keyword Spotting on Mobile Devices

TLDR
A temporal convolution for real-time KWS on mobile devices that exploits temporal convolutions with a compact ResNet architecture and achieves more than \textbf{385x} speedup on Google Pixel 1 and surpass the accuracy compared to the state-of-the-art model.

Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting

TLDR
It is shown that the combination of 3 techniques LVCSR-initialization, multi-task training and weighted cross-entropy gives the best results, with significantly lower False Alarm Rate than the LV CSR- initialization technique alone, across a wide range of Miss Rates.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition

TLDR
This paper explores how the current speech synthesis technology can be leveraged to tailor the ASR system for a target domain by preparing only a relevant text corpus and generates speech features using a sequence-to-sequence speech synthesizer.

Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting

TLDR
This paper proposes to apply singular value decomposition (SVD) to further reduce TDNN complexity, and results show that the full-rank TDNN achieves a 19.7% DET AUC reduction compared to a similar-size deep neural network baseline.

HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines

  • Xiaoyi QinHui BuMing Li
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A far-field text-dependent speaker verification database named HI-MIA is presented and a set of end-to-end neural network based baseline systems that adopt single-channel data for training are proposed.