• Corpus ID: 10623622

Convolutional deep maxout networks for phone recognition

@inproceedings{Tth2014ConvolutionalDM,
  title={Convolutional deep maxout networks for phone recognition},
  author={L{\'a}szl{\'o} T{\'o}th},
  booktitle={INTERSPEECH},
  year={2014}
}
  • L. Tóth
  • Published in INTERSPEECH 2014
  • Computer Science
Convolutional neural networks have recently been shown to outperform fully connected deep neural networks on several speech recognition tasks. Their superior performance is due to their convolutional structure that processes several, slightly shifted versions of the input window using the same weights, and then pools the resulting neural activations. This pooling operation makes the network less sensitive to translations. The convolutional network results published up till now used sigmoid or… 

Figures and Tables from this paper

Deep neural networks with linearly augmented rectifier layers for speech recognition
  • L. Tóth
  • Computer Science
    2018 IEEE 16th World Symposium on Applied Machine Intelligence and Informatics (SAMI)
  • 2018
TLDR
This work combines the two approaches and proposes the very simple technique of composing the layers of the network both from rectified and linear neurons, which performs equivalently or slightly better than a maxout network when trained on a larger data set, while it is computationally simpler.
Maxout neurons for deep convolutional and LSTM neural networks in speech recognition
Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition
TLDR
The results obtained from the experiments show that the combined model (CMDNN) improves the performance of ANNs in speech recognition versus the pre-trained fully connected fully connected NNs with sigmoid neurons by about 3%.
Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition
TLDR
The outstanding performance of the novel deep recurrent convolutional neural network applied with deep residual learning indicates that it can be potentially adopted in other sequential problems.
A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
TLDR
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN.
Modeling long temporal contexts in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended
Multi-resolution spectral input for convolutional neural network-based speech recognition
  • L. Tóth
  • Computer Science
    2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
  • 2017
TLDR
This work investigates whether it can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input and achieves a relative error rate reduction of 3–4% compared to the conventional high-resolution representation.
Deep Residual Networks with Auditory Inspired Features for Robust Speech Recognition
TLDR
A Deep Residual Network architecture is proposed, allowing ResNets to be used in speech recognition tasks where the network input is small in comparison with the image dimensions for which they were initially designed, and a modification of the well-known Power Normalized Cepstral Coefficients as input to the ResNet is introduced with the aim of creating a noise invariant representation of the acoustic space.
Deep convolutional neural networks for acoustic modeling in low resource languages
  • William ChanI. Lane
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
A detailed empirical study of CNNs under the low resource condition is performed, and finds a two dimensional convolutional structure performs the best, and emphasizes the importance to consider time and spectrum in modelling acoustic patterns.
Recurrent DNNs and its Ensembles on the TIMIT Phone Recognition Task
TLDR
Just an ensemble of recurrent DNNs performed best and achieved an average phone error rate from 10 experiments that is slightly lower then the best-published PER to date, according to the knowledge.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Improving deep neural network acoustic models using generalized maxout networks
TLDR
This paper introduces two new types of generalized maxout units, which they are called p-norm and soft-maxout, and presents a method to control that instability during training when training unbounded-output nonlinearities.
Stochastic pooling maxout networks for low-resource speech recognition
  • Meng CaiYongzhe ShiJia Liu
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
A stochastic pooling regularization method for max-out networks is proposed to control overfitting and is applied within the DNN-HMM framework and evaluates its effectiveness under a low-resource speech recognition condition.
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
On rectified linear units for speech processing
TLDR
This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
Phone recognition with deep sparse rectifier neural networks
  • L. Tóth
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
TLDR
It is shown that a deep architecture of rectifier neurons can attain the same recognition accuracy as deep neural networks, but without the need for pre-training.
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.
...
...