Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
@article{Kriman2020QuartznetDA, title={Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions}, author={Samuel Kriman and Stanislav Beliaev and Boris Ginsburg and Jocelyn Huang and Oleksii Kuchaiev and Vitaly Lavrukhin and Ryan Leary and Jason Li and Yang Zhang}, journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2020}, pages={6124-6128} }
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also…
122 Citations
MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
- Computer ScienceINTERSPEECH
- 2020
MatchboxNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that reaches state-of-the-art accuracy on the Google Speech Commands dataset while having significantly fewer parameters than similar models.
ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network
- Computer ScienceArXiv
- 2020
ConVoice can convert speech of any length without compromising quality due to its convolutional architecture, and has comparable quality to similar state-of-the-art models while being extremely fast.
Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
The proposed Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.
MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers that is able to achieve similar performance with roughly 1/10-th the parameter cost of state-of-the-art VAD model.
CarneliNet: Neural Mixture Model for Automatic Speech Recognition
- Computer ScienceArXiv
- 2021
CarneliNet is designed – a CTC-based neural network composed of three mega-blocks composed of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions that demonstrates that one can dynamically reconfigure the number of parallel sub-network to accommodate the computational requirements without retraining.
SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification.
- Computer Science
- 2020
SpeakerNet - a new neural architecture for speaker recognition and speaker verification tasks, composed of residual blocks with 1D depth-wise separable convolutions, batch-normalization, and ReLU layers, uses x-vector based statistics pooling layer to map variable-length utterances to a fixed-length embedding.
Scaling Up Online Speech Recognition Using ConvNets
- Computer ScienceINTERSPEECH
- 2020
An online end-to-end speech recognition system based on Time-Depth Separable convolutions and Connectionist Temporal Classification that has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context
- Computer ScienceArXiv
- 2021
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeezeand-Excitation (SE) layers…
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
- Computer ScienceINTERSPEECH
- 2020
This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis
- Computer ScienceInterspeech
- 2021
TalkNet is a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction that eliminates word skipping and repeating and is an attractive candidate for embedded speech synthesis.
References
SHOWING 1-10 OF 31 REFERENCES
Letter-Based Speech Recognition with Gated ConvNets
- PhysicsArXiv
- 2017
A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.
Fully Convolutional Speech Recognition
- Computer ScienceArXiv
- 2018
This paper presents an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling, trained end-to-end to predict characters from theRaw waveform, removing the feature extraction step altogether.
Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions
- Computer ScienceINTERSPEECH
- 2019
We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient…
Jasper: An End-to-End Convolutional Neural Acoustic Model
- Computer ScienceINTERSPEECH
- 2019
This paper reports state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data and introduces a new layer-wise optimizer called NovoGrad to improve training.
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
- Computer ScienceINTERSPEECH
- 2019
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
End-to-End Speech Recognition From the Raw Waveform
- Computer ScienceINTERSPEECH
- 2018
End-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions and shows a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel- filterbanks.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
- Computer ScienceICML
- 2014
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the…
Audio augmentation for speech recognition
- Computer ScienceINTERSPEECH
- 2015
This paper investigates audio-level speech augmentation methods which directly process the raw signal, and presents results on 4 different LVCSR tasks with training data ranging from 100 hours to 1000 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios.
State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
A new neural network model architecture, namely multi-stream self-attention, is proposed to address the issue thus make the self-Attention mechanism more effective for speech recognition and achieve the word error rate of 2.2% on the test-clean dataset of the LibriSpeech corpus.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
- Computer ScienceICML
- 2006
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.