MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

@inproceedings{Majumdar2020MatchboxNet1T,
  title={MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition},
  author={Somshubra Majumdar and Boris Ginsburg},
  booktitle={INTERSPEECH},
  year={2020}
}
We present an MatchboxNet - an end-to-end neural network for speech command recognition. MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. MatchboxNet reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models. The small footprint of MatchboxNet makes it an attractive candidate for devices with limited computational… 

Figures and Tables from this paper

Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting
TLDR
A deep neural network is proposed that can rapidly establish a high-performance KWS system from arbitrary keyword instruction sets using an encoder pretrained with a large-scale speech corpus as the backbone network and an effective transfer network for KWS.
Convmixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-Field Keyword Spotting
TLDR
A novel feature interactive convolutional model with merely 100K parameters is proposed in place of the attention module that promotes the flow of information with more efficient computations under the noisy far-field condition.
Encoder-Decoder Neural Architecture Optimization for Keyword Spotting
TLDR
This paper utilizes neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint.
ImportantAug: a data augmentation agent for speech
TLDR
The proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added, and also provides a 25.4% error rate reduction compared to a baseline without data augmentation.
Broadcasted Residual Learning for Efficient Keyword Spotting
TLDR
This work presents a broadcasted residual learning method to achieve high accuracy with small model size and computational load, and proposes a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasting residual learning and describes how to scale up the model according to the target de-vice’s resources.
Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection
TLDR
This work proposes an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance.
Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices
We demonstrate that 1x1-convolutions in 1D time-channel separable convolutions may be replaced by constant, sparse random ternary matrices with weights in {−1, 0,+1}. Such layers do not perform any
A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming
TLDR
Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Lithuanian and Arabic speech commands datasets, with only a limited amount of training data.
Attention-Free Keyword Spotting
TLDR
This work explores the usage of gated MLPs—previously shown to be alternatives to transformers in vision tasks—for the keyword spotting task and provides a family of highly efficient MLP-based models for keyword spotting, with less than 0.5 million parameters.
An Integrated Framework for Two-pass Personalized Voice Trigger
TLDR
The XMUSPEECH system for Task 1 of 2020 Personalized Voice Trigger Challenge (PVTC2020) is presented, a joint wake-up word detection with speaker verification on close talking data and a multi-task learning network is proposed, where phonetic branch istrained with the character label of the utterance, and speaker branch is trained with the label ofThe speaker.
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
TLDR
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.
Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting
TLDR
Systems and methods for creating and using Convolutional Recurrent Neural Networks for small-footprint keyword spotting (KWS) systems and a CRNN model embodiment demonstrated high accuracy and robust performance in a wide range of environments are described.
A neural attention model for speech command recognition
TLDR
A convolutional recurrent network with attention for speech command recognition that establishes a new state-of-the-art accuracy of 94.1% and allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge
TLDR
This study explores a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration.
Streaming End-to-end Speech Recognition for Mobile Devices
TLDR
This work describes its efforts at building an E2E speech recog-nizer using a recurrent neural network transducer and finds that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy.
Training Keyword Spotters with Limited and Synthesized Speech Data
TLDR
This paper uses a pre-trained speech embedding model trained to extract useful features for keyword spotting models, and shows that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.
Efficient Keyword Spotting Using Dilated Convolutions and Gating
TLDR
A model inspired by the recent success of dilated convolutions in sequence modeling applications, allowing to train deeper architectures in resource-constrained configurations, and applies a custom target labeling that back-propagates loss from specific frames of interest, therefore yielding higher accuracy and only requiring to detect the end of the keyword.
Very deep convolutional neural networks for robust speech recognition
  • Y. Qian, P. Woodland
  • Computer Science
    2016 IEEE Spoken Language Technology Workshop (SLT)
  • 2016
TLDR
The extension and optimisation of previous work on very deep convolutional neural networks for effective recognition of noisy speech in the Aurora 4 task are described and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective.
Data-Driven Harmonic Filters for Audio Representation Learning
TLDR
Experimental results show that a simple convolutional neural network back-end with the proposed front-end outperforms state-of-the-art baseline methods in automatic music tagging, keyword spotting, and sound event tagging tasks.
...
1
2
3
4
...