MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

@inproceedings{Majumdar2020MatchboxNet1T,
  title={MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition},
  author={Somshubra Majumdar and Boris Ginsburg},
  booktitle={INTERSPEECH},
  year={2020}
}
We present an MatchboxNet - an end-to-end neural network for speech command recognition. MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. MatchboxNet reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. The small footprint of MatchboxNet makes it an attractive candidate for devices with limited computational… 

Figures and Tables from this paper

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
TLDR
MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that is able to achieve similar performance with roughly 1/10-th the parameter cost of state-of-the-art VAD model.
Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting
TLDR
A deep neural network is proposed that can rapidly establish a high-performance KWS system from arbitrary keyword instruction sets using an encoder pretrained with a large-scale speech corpus as the backbone network and an effective transfer network for KWS.
Convmixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-Field Keyword Spotting
TLDR
A novel feature interactive convolutional model with merely 100K parameters is proposed in place of the attention module that promotes the flow of information with more efficient computations under the noisy far-field condition.
Neural Architecture Search For Keyword Spotting
TLDR
This paper uses differentiable architecture search techniques to search for operators and their connections in a predefined cell search space and achieves state-of-the-art accuracy on the setting of 12-class utterance classification commonly reported in the literature.
Encoder-Decoder Neural Architecture Optimization for Keyword Spotting
TLDR
This paper utilizes neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint.
Keyword Transformer: A Self-Attention Model for Keyword Spotting
TLDR
The Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data, is introduced.
AST: Audio Spectrogram Transformer
TLDR
The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, which achieves new state-of-the-art results on various audio classification benchmarks.
ImportantAug: a data augmentation agent for speech
TLDR
The proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added, and also provides a 25.4% error rate reduction compared to a baseline without data augmentation.
Broadcasted Residual Learning for Efficient Keyword Spotting
TLDR
This work presents a broadcasted residual learning method to achieve high accuracy with small model size and computational load, and proposes a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasting residual learning and describes how to scale up the model according to the target de-vice’s resources.
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection
TLDR
This work proposes an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
TLDR
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
TLDR
Systems and methods for creating and using Convolutional Recurrent Neural Networks for small-footprint keyword spotting (KWS) systems and a CRNN model embodiment demonstrated high accuracy and robust performance in a wide range of environments are described.
A neural attention model for speech command recognition
TLDR
A convolutional recurrent network with attention for speech command recognition that establishes a new state-of-the-art accuracy of 94.1% and allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge
TLDR
This study explores a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration.
Streaming End-to-end Speech Recognition for Mobile Devices
TLDR
This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.
Training Keyword Spotters with Limited and Synthesized Speech Data
TLDR
This paper uses a pre-trained speech embedding model trained to extract useful features for keyword spotting models, and shows that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.
Efficient Keyword Spotting Using Dilated Convolutions and Gating
TLDR
A model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations, and applies a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword.
Very deep convolutional neural networks for robust speech recognition
  • Y. Qian, P. Woodland
  • Computer Science
    2016 IEEE Spoken Language Technology Workshop (SLT)
  • 2016
TLDR
The extension and optimisation of previous work on very deep convolutional neural networks for effective recognition of noisy speech in the Aurora 4 task are described and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective.
Data-Driven Harmonic Filters for Audio Representation Learning
TLDR
Experimental results show that a simple convolutional neural network back-end with the proposed front-end outperforms state-of-the-art baseline methods in automatic music tagging, keyword spotting, and sound event tagging tasks.
...
...