Efficient keyword spotting using time delay neural networks

@inproceedings{Myer2018EfficientKS,
  title={Efficient keyword spotting using time delay neural networks},
  author={Samuel Myer and Vikrant Singh Tomar},
  booktitle={INTERSPEECH},
  year={2018}
}
This paper describes a novel method of live keyword spotting using a two-stage time delay neural network. [] Key Method The model is trained using transfer learning: initial training with phone targets from a large speech corpus is followed by training with keyword targets from a smaller data set. The accuracy of the system is evaluated on two separate tasks. The first is the freely available Google Speech Commands dataset. The second is an in-house task specifically developed for keyword spotting. The…

Figures and Tables from this paper

A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting
TLDR
This work proposes a time delay neural network with shared weight self-attention for small-footprint keyword spotting that achieves an error rate comparable to the ResNet model.
State Sequence Pooling Training of Acoustic Models for Keyword Spotting
TLDR
A new training method to improve HMM-based keyword spotting is proposed based on a score computed with the keyword/filler model from the entire input sequence that yields significant and consistent improvement over the baseline in adverse noise conditions.
Query-by-Example On-Device Keyword Spotting
TLDR
A threshold prediction method while using the user-specific keyword hypothesis only is proposed, which generates query-specific negatives by rearranging each query utterance in waveform and decides the threshold based on the enrollment queries and generated negatives.
Deep Spoken Keyword Spotting: An Overview
TLDR
The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.
Effective Combination of DenseNet and BiLSTM for Keyword Spotting
TLDR
A new network architecture (DenseNet-BiLSTM) is proposed for KWS, which removes the pool on the time dimension in transition layers to preserve speech time series information and outperforms the state-of-the-art methods in terms of accuracy on Google Speech Commands dataset.
Event-driven Pipeline for Low-latency Low-compute Keyword Spotting and Speaker Verification System
TLDR
Evaluation on a self-recorded event dataset based on TIDIGITS shows accuracies of over 93% and 88% on KWS and SV respectively, with minimum system latency of 5 ms on a limited resource device.
Seeing wake words: Audio-visual Keyword Spotting
TLDR
A novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into sequence matching, and pattern detection, to decide whether and when a word of interest is spoken by a talking face, with or without the audio.
Towards noise robust trigger-word detection with contrastive learning pre-task for fast on-boarding of new trigger-words
TLDR
This work explores the use of contrastive learning as a pre-training task that helps the detection model to generalize to different words and noise conditions and proposes a self-supervised technique using chunked words from long sentence audios.
DONUT: CTC-based Query-by-Example Keyword Spotting
TLDR
DonUT is presented, a CTC-based algorithm for online query-by-example keyword spotting that enables custom wakeword detection and has low computational requirements and is well-suited for both learning and inference on embedded systems without requiring private user data to be uploaded to the cloud.
Investigation of Acoustic Features for Voice Activation Problem
  • Aliaksei Kolesau, D. Šešok
  • Computer Science
    2020 IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream)
  • 2020
TLDR
The results show that CNNs benifit from using prior knowledge in acoustic feature computation, and the default values of MFCCs parameters might not be the best for voice activation problem: frame length of 55 ms showed better results than default length of 20 ms.
...
...

References

SHOWING 1-10 OF 18 REFERENCES
Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting
TLDR
This paper proposes to apply singular value decomposition (SVD) to further reduce TDNN complexity, and results show that the full-rank TDNN achieves a 19.7% DET AUC reduction compared to a similar-size deep neural network baseline.
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
TLDR
Systems and methods for creating and using Convolutional Recurrent Neural Networks for small-footprint keyword spotting (KWS) systems and a CRNN model embodiment demonstrated high accuracy and robust performance in a wide range of environments are described.
Small-footprint keyword spotting using deep neural networks
TLDR
This application requires a keyword spotting system with a small memory footprint, low computational cost, and high precision, and proposes a simple approach based on deep neural networks that achieves 45% relative improvement with respect to a competitive Hidden Markov Model-based system.
An End-to-End Architecture for Keyword Spotting and Voice Activity Detection
TLDR
Novel inference algorithms for an end-to-end Recurrent Neural Network trained with the Connectionist Temporal Classification loss function are developed which allow the model to achieve high accuracy on both keyword spotting and voice activity detection without retraining.
Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting
TLDR
It is shown that the combination of 3 techniques LVCSR-initialization, multi-task training and weighted cross-entropy gives the best results, with significantly lower False Alarm Rate than the LV CSR- initialization technique alone, across a wide range of Miss Rates.
Phoneme recognition using time-delay neural networks
The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing
End-to-end ASR-free keyword search from speech
TLDR
This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.
Accurate and compact large vocabulary speech recognition on mobile devices
TLDR
An accurate, smallfootprint, large vocabulary speech recognizer for mobile devices and an accurate and compact system that runs well below real-time on a Nexus 4 Android phone is described.
Acoustic Modeling for Google Home
TLDR
The technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016, result in a reduction of WER of 8-28% relative to the current production system.
Convolutional neural networks for small-footprint keyword spotting
TLDR
This work explores using Convolutional Neural Networks for a small-footprint keyword spotting task and finds that the CNN architectures offer between a 27-44% relative improvement in false reject rate compared to a DNN, while fitting into the constraints of each application.
...
...