Modeling long temporal contexts in convolutional neural network-based phone recognition

@article{Tth2015ModelingLT,
  title={Modeling long temporal contexts in convolutional neural network-based phone recognition},
  author={L{\'a}szl{\'o} T{\'o}th},
  journal={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2015},
  pages={4575-4579}
}
  • L. Tóth
  • Published 19 April 2015
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended by splitting it up and modeling it in smaller chunks. One method for this is to train a hierarchy of two networks, while the less well-known split temporal context (STC) method models the left and right contexts of a frame separately. Here, we evaluate these techniques within a convolutional neural… 

Figures and Tables from this paper

Multi-resolution spectral input for convolutional neural network-based speech recognition
  • L. Tóth
  • Computer Science
    2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
  • 2017
TLDR
This work investigates whether it can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input and achieves a relative error rate reduction of 3–4% compared to the conventional high-resolution representation.
Low latency modeling of temporal contexts for speech recognition
TLDR
A sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs), are proposed, which reduce the computation complexity by ∼ 5x, compared to TDNNs, during frame ii.
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention
TLDR
A deep convolutional neural network with layer-wise context expansion and location-based attention, for large vocabulary speech recognition, and it is shown that this model outperforms both the DNN and LSTM significantly.
Recent progresses in deep learning based acoustic models
TLDR
This paper describes models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification U+0028 CTC U-0029 criterion, and the attention-based sequenceto-sequence translation model.
Discriminative Autoencoders for Acoustic Modeling
TLDR
This paper describes the implementation of the discriminative autoencoders for training tri-phone acoustic models and presents TIMIT phone recognition results, which demonstrate that the proposed method outperforms the conventional DNN-based approach.
Deep Learning for Distant Speech Recognition
TLDR
Inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, a novel deep learning paradigm called “network of deep neural networks” is proposed.
Speech recognition based strategies for on-line Computer Assisted Language Learning (CALL) systems in Basque
TLDR
In this thesis, utterance verification techniques have been implemented to be run together with the basic tasks of speech recognition software, and it is found that slightly better HMMs are achieved using data with few errors in the initial stages of the training and training and using the whole database.
A Multiversion Programming Inspired Approach to Detecting Audio Adversarial Examples
TLDR
A novel audio AE detection approach MVP-Ears is proposed, which utilizes the diverse off-the-shelf ASRs to determine whether an audio is an AE, and the evaluation shows that the detection accuracy reaches 99.88%.
Towards a system for automatic traffic sound event detection
TLDR
The aim is to devise a robust system capable of traffic audio events detection in a real-life environment using a deep learning model capable of detecting anomalous events and their classification based on their acoustic waveform.
Hizketa-ezagutzan oinarritutako estrategiak, euskarazko online OBHI (Ordenagailu Bidezko Hizkuntza Ikaskuntza) sistemetarako
TLDR
This thesis proposes a novel Word-by-Word Sentence Verification (WWSV) system, where a sentence is verified sequentially, word by word, and as soon as a word is detected it is displayed to the user.

References

SHOWING 1-10 OF 27 REFERENCES
Modeling long temporal contexts for robust DNN-based speech recognition
TLDR
This work investigates the potential of modeling long temporal acoustic contexts using DNNs and models a 65-frame temporal window of speech signals and yields a 4.4% WER, outperforming the best single DNN by 12.0% relatively.
Deep learning of split temporal context for automatic speech recognition
TLDR
This work proposes an alternative solution that splits the temporal context into blocks, each learned with a separate deep model, and demonstrates that this approach significantly reduces the number of parameters compared to the classical deep learning procedure, and obtains better results on the TIMIT dataset.
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Speech Recognition Using Long-Span Temporal Patterns in a Deep Network Model
TLDR
It is shown that word recognition accuracy can be significantly enhanced by arranging DNNs in a hierarchical structure to model long-term energy trajectories and evaluated on the 5000-word Wall Street Journal task.
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling
TLDR
This paper investigates DNN for several large vocabulary speech recognition tasks and proposes a few ideas to reconfigure the DNN input features, such as using logarithm spectrum features or VTLN normalized features in DNN.
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
On rectified linear units for speech processing
TLDR
This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
Convolutional deep maxout networks for phone recognition
TLDR
Phone recognition tests on the TIMIT database show that switching to maxout units from rectifier units decreases the phone error rate for each network configuration studied, and yields relative error rate reductions of between 2% and 6%.
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.
...
...