MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

  title={MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection},
  author={Fei Jia and Somshubra Majumdar and Boris Ginsburg},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. When compared to a state-of-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive ablation studies on different training methods and choices of parameters in order to study the… 

Figures and Tables from this paper

Voice Activity Segment Audio Deduction Method using MobileNet
This paper proposes a method to contract only voice activity segments excluding non-speech segments using MobileNet, and achieves the voice detection accuracy of 93.92% for each segment and the reduction accuracy of 88.05% for the entire audio.
Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition
This paper considers a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available, and proposes to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors.
Towards Better Uncertainty: Iterative Training of Efficient Networks for Multitask Emotion Recognition
The multi-generational self-distillation algorithm is proposed to apply to emotion recognition task towards better uncertainty estimation performance, and gives more reliable uncertainty estimates than Temperature Scaling and Monte Carol Dropout.


Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.
Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection
This paper proposes a novel approach to VAD to tackle both feature and model selection jointly and shows that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments.
Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions
CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels to illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.
Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection
This paper proposes an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN), which differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections.
Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies
A novel, data-driven approach to voice activity detection based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features clearly outperforming three state-of-the-art reference algorithms under the same conditions.
Jasper: An End-to-End Convolutional Neural Acoustic Model
This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training.
Personal VAD: Speaker-Conditioned Voice Activity Detection
This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
A new dataset is described which will be released publicly containing densely labeled speech activity in YouTube videos, with the goal of creating a shared, available dataset for speech activity detection.
Letter-Based Speech Recognition with Gated ConvNets
A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.