Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

  title={Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions},
  author={Samuel Kriman and Stanislav Beliaev and Boris Ginsburg and Jocelyn Huang and Oleksii Kuchaiev and Vitaly Lavrukhin and Ryan Leary and Jason Li and Yang Zhang},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also… 

Figures and Tables from this paper

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models.
ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network
ConVoice can convert speech of any length without compromising quality due to its convolutional architecture, and has comparable quality to similar state-of-the-art models while being extremely fast.
Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion
The proposed Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.
MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that is able to achieve similar performance with roughly 1/10-th the parameter cost of state-of-the-art VAD model.
CarneliNet: Neural Mixture Model for Automatic Speech Recognition
CarneliNet is designed – a CTC-based neural network composed of three mega-blocks composed of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions that demonstrates that one can dynamically reconfigure the number of parallel sub-network to accommodate the computational requirements without retraining.
SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification.
SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks, composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers, uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding.
Scaling Up Online Speech Recognition Using ConvNets
An online end-to-end speech recognition system based on Time-Depth Separable convolutions and Connectionist Temporal Classification that has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeezeand-Excitation (SE) layers
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
TalkNet is a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction that eliminates word skipping and repeating and is an attractive candidate for embedded speech synthesis.


Letter-Based Speech Recognition with Gated ConvNets
A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.
Fully Convolutional Speech Recognition
This paper presents an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling, trained end-to-end to predict characters from theRaw waveform, removing the feature extraction step altogether.
Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions
We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient
Jasper: An End-to-End Convolutional Neural Acoustic Model
This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
End-to-End Speech Recognition From the Raw Waveform
End-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions and shows a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel- filterbanks.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Audio augmentation for speech recognition
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions
A new neural network model architecture, namely multi-stream self-attention, is proposed to address the issue thus make the self-Attention mechanism more effective for speech recognition and achieve the word error rate of 2.2% on the test-clean dataset of the LibriSpeech corpus.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.