• Corpus ID: 9609298

Residual Convolutional CTC Networks for Automatic Speech Recognition

@article{Wang2017ResidualCC,
  title={Residual Convolutional CTC Networks for Automatic Speech Recognition},
  author={Yisen Wang and Xuejiao Deng and Songbai Pu and Zhiheng Huang},
  journal={ArXiv},
  year={2017},
  volume={abs/1702.07793}
}
Deep learning approaches have been widely used in Automatic Speech Recognition (ASR) and they have achieved a significant accuracy improvement. [] Key Method RCNN-CTC is an end-to-end system which can exploit temporal and spectral structures of speech signals simultaneously. Furthermore, we introduce a CTC-based system combination, which is different from the conventional frame-wise senone-based one.

Figures and Tables from this paper

Deep Group Residual Convolutional CTC Networks for Speech Recognition
TLDR
A novel neural network, denoted as GRCNN-CTC, which integrates group residual convloutional blocks and recurrent layers paired with Connectionist Temporal Classification (CTC) loss is proposed, which greatly reduces computational overhead and converges faster, leading to scale up to deeper architecture.
A Chinese acoustic model based on convolutional neural network
TLDR
A convolutional neural network architecture composed of VGG and Connectionist Temporal Classification (CTC) loss function was proposed for speech recognition acoustic model and demonstrated that the proposed model achieves the Character error rate (CER) of 17.97% and 23.86% on public Mandarin speech corpus, AISHELL-1 and ST-CMDS-20170001_1, respectively.
Bidirectional Temporal Convolution with Self-Attention Network for CTC-Based Acoustic Modeling
  • Jian Sun, Wu Guo, Bin Gu, Yao Liu
  • Computer Science
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
TLDR
The bidirectional temporal convolution with self-attention network (BTCSAN) is proposed in order to capture both the global and local dependencies of utterances and can obtain a 15.87% relative improvement over the BLSTM-based CTC baseline.
A Study of All-Convolutional Encoders for Connectionist Temporal Classification
TLDR
This work presents an exploration of CNN s as encoders for CTC models, in the context of character-based (lexicon-free) automatic speech recognition, and explores a range of one-dimensional convolutionallayers, which are particularly efficient.
FOR CONNECTIONIST TEMPORAL CLASSIFICATION IN SPEECH RECOGNITION
TLDR
This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-toend speech recognition, and explores how label alphabets affect attention heads and performance.
Self-attention Networks for Connectionist Temporal Classification in Speech Recognition
TLDR
This work proposes SAN-CTC, a deep, fully self-attentional network for CTC, and shows it is tractable and competitive for end-to-end speech recognition, and explores how label alphabets affect attention heads and performance.
Multi-encoder multi-resolution framework for end-to-end speech recognition
TLDR
A novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model that achieves 3.6% WER in the WSJ eval92 test set, which is the best WER reported for an end-to-end system on this benchmark.
Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
TLDR
The results of contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems are presented and the optimal neural network structures and training strategies for the proposed neural network models are explored.
Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks
TLDR
A very deep residual time-delay (VResTD) network is used for CTC-based E2E acoustic modeling (V ResTD-CTC) that provides frame-wise outputs with local bidirectional information without needing to wait for the whole utterance.
Multi-Stream End-to-End Speech Recognition
TLDR
A multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information with relative WER reduction and relative Word Error Rate reduction is presented.
...
...

References

SHOWING 1-10 OF 28 REFERENCES
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
Advances in Very Deep Convolutional Neural Networks for LVCSR
TLDR
This paper proposes a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data.
Convolutional Neural Networks for Speech Recognition
TLDR
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Very deep convolutional neural networks for robust speech recognition
  • Y. Qian, P. Woodland
  • Computer Science
    2016 IEEE Spoken Language Technology Workshop (SLT)
  • 2016
TLDR
The extension and optimisation of previous work on very deep convolutional neural networks for effective recognition of noisy speech in the Aurora 4 task are described and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective.
Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition
TLDR
The outstanding performance of the novel deep recurrent convolutional neural network applied with deep residual learning indicates that it can be potentially adopted in other sequential problems.
Very deep convolutional networks for end-to-end speech recognition
TLDR
This work successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models, and applies network-in-network principles, batch normalization, residual connections and convolutionAL LSTMs to build very deep recurrent and Convolutional structures.
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
TLDR
A novel architecture for a deep recurrent neural network, residual LSTM is introduced, which separates a spatial shortcut path with temporal one by using output layers, which can help to avoid a conflict between spatial and temporal-domain gradient flows.
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Analysis of CNN-based speech recognition system using raw speech as input
TLDR
This paper analyzes and shows that the CNN-based approach yields ASR trends similar to standard short-term spectral based ASR sys-tem under mismatched (noisy) conditions, with theCNN-basedapproach being more robust.
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
...
...