• Corpus ID: 10623622

Convolutional deep maxout networks for phone recognition

@inproceedings{Tth2014ConvolutionalDM,
  title={Convolutional deep maxout networks for phone recognition},
  author={L{\'a}szl{\'o} T{\'o}th},
  booktitle={INTERSPEECH},
  year={2014}
}
  • L. Tóth
  • Published in INTERSPEECH 2014
  • Computer Science
Convolutional neural networks have recently been shown to outperform fully connected deep neural networks on several speech recognition tasks. Their superior performance is due to their convolutional structure that processes several, slightly shifted versions of the input window using the same weights, and then pools the resulting neural activations. This pooling operation makes the network less sensitive to translations. The convolutional network results published up till now used sigmoid or… 

Figures and Tables from this paper

Deep neural networks with linearly augmented rectifier layers for speech recognition
  • L. Tóth
  • Computer Science
    2018 IEEE 16th World Symposium on Applied Machine Intelligence and Informatics (SAMI)
  • 2018
TLDR
This work combines the two approaches and proposes the very simple technique of composing the layers of the network both from rectified and linear neurons, which performs equivalently or slightly better than a maxout network when trained on a larger data set, while it is computationally simpler.
Maxout neurons for deep convolutional and LSTM neural networks in speech recognition
Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition
TLDR
The results obtained from the experiments show that the combined model (CMDNN) improves the performance of ANNs in speech recognition versus the pre-trained fully connected fully connected NNs with sigmoid neurons by about 3%.
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition
TLDR
The outstanding performance of the novel deep recurrent convolutional neural network applied with deep residual learning indicates that it can be potentially adopted in other sequential problems.
A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
TLDR
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN.
Evaluation of maxout activations in deep learning across several big data domains
TLDR
It is found that on average, across all datasets, the Rectified Linear Unit activation function performs better than any maxout activation when the number of convolutional filters is increased, without adversely affecting their advantage over maxout activations with respect to network-training speed.
Modeling long temporal contexts in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
The deep neural network component of current hybrid speech recognizers is trained on a context of consecutive feature vectors. Here, we investigate whether the time span of this input can be extended
Multi-resolution spectral input for convolutional neural network-based speech recognition
  • L. Tóth
  • Computer Science
    2017 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)
  • 2017
TLDR
This work investigates whether it can extend the time span of this input and reduce the number of spectral features at the same time by using a multi-resolution spectrum as input and achieves a relative error rate reduction of 3–4% compared to the conventional high-resolution representation.
An Analysis of Deep Neural Networks in Broad Phonetic Classes for Noisy Speech Recognition
TLDR
The experiments demonstrate that performance is still tightly related to the particular phonetic class being stops and affricates the least resilient but also that relative improvements of both DNN variants are distributed unevenly across those classes having the type of noise a significant influence on the distribution.
...
...

References

SHOWING 1-10 OF 25 REFERENCES
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Convolutional deep rectifier neural nets for phone recognition
TLDR
This work modified the rectifier network so that it has a convolutional structure, and found that with deep rectifier networks one can attain a similar speech recognition performance than that with sigmoid nets, but without the need for the time-consuming pre-training procedure.
Deep maxout neural networks for speech recognition
TLDR
Experimental results demonstrate that max out networks converge faster, generalize better and are easier to optimize than rectified linear networks and sigmoid networks, and experiments show that maxout networks reduce underfitting and are able to achieve good results without dropout training.
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion
We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while
Improving deep neural network acoustic models using generalized maxout networks
TLDR
This paper introduces two new types of generalized maxout units, which they are called p-norm and soft-maxout, and presents a method to control that instability during training when training unbounded-output nonlinearities.
Stochastic pooling maxout networks for low-resource speech recognition
  • Meng Cai, Yongzhe Shi, Jia Liu
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
A stochastic pooling regularization method for max-out networks is proposed to control overfitting and is applied within the DNN-HMM framework and evaluates its effectiveness under a low-resource speech recognition condition.
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
...
...