Phone recognition with hierarchical convolutional deep maxout networks

@article{Tth2015PhoneRW,
  title={Phone recognition with hierarchical convolutional deep maxout networks},
  author={L{\'a}szl{\'o} T{\'o}th},
  journal={EURASIP Journal on Audio, Speech, and Music Processing},
  year={2015},
  volume={2015},
  pages={1-13}
}
  • L. Tóth
  • Published 2015
  • Computer Science
  • EURASIP Journal on Audio, Speech, and Music Processing
Deep convolutional neural networks (CNNs) have recently been shown to outperform fully connected deep neural networks (DNNs) both on low-resource and on large-scale speech tasks. Experiments indicate that convolutional networks can attain a 10–15 % relative improvement in the word error rate of large vocabulary recognition tasks over fully connected deep networks. Here, we explore some refinements to CNNs that have not been pursued by other authors. First, the CNN papers published up till now… Expand
A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition
TLDR
A hybrid architecture of CNN-BLSTM is proposed to appropriately use spatial and temporal properties of the speech signal and to improve the continuous speech recognition task and overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Expand
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices. Expand
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
TLDR
This paper proposes an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections, and argues that CNNs have the capability to model temporal correlations with appropriate context information. Expand
Deep neural networks with linearly augmented rectifier layers for speech recognition
  • L. Tóth
  • Computer Science
  • 2018 IEEE 16th World Symposium on Applied Machine Intelligence and Informatics (SAMI)
  • 2018
TLDR
This work combines the two approaches and proposes the very simple technique of composing the layers of the network both from rectified and linear neurons, which performs equivalently or slightly better than a maxout network when trained on a larger data set, while it is computationally simpler. Expand
A comparative analysis of pooling strategies for convolutional neural network based Hindi ASR
TLDR
This work explores different pooling methods for different architectures on Hindi speech dataset and shows that max pooling performs well when tested for clean speech and stochastic pooling works well in the noisy environment. Expand
TIMIT and NTIMIT Phone Recognition Using Convolutional Neural Networks
TLDR
It will be shown that this convolutional neural network approach is particularly well suited to network noise and the distortion of speech data, as demonstrated by the state-of-the-art benchmark results for NTIMIT. Expand
Evaluation of maxout activations in deep learning across several big data domains
TLDR
It is found that on average, across all datasets, the Rectified Linear Unit activation function performs better than any maxout activation when the number of convolutional filters is increased, without adversely affecting their advantage over maxout activations with respect to network-training speed. Expand
Convolutional Neural Networks for Raw Speech Recognition
TLDR
CNN-based acoustic model for raw speech signal is discussed, which establishes the relation between rawspeech signal and phones in a data-driven manner and performs better than traditional cepstral feature-based systems. Expand
Recurrent DNNs and its Ensembles on the TIMIT Phone Recognition Task
TLDR
Just an ensemble of recurrent DNNs performed best and achieved an average phone error rate from 10 experiments that is slightly lower then the best-published PER to date, according to the knowledge. Expand
Multi-level region-of-interest CNNs for end to end speech recognition
TLDR
A new pooling technique, multileVEL region of interest (RoI) pooling is proposed which pools the multilevel information from multiple ConvNet layers which improves extracted features using additional information from the multilesvel convolutional neural network layers. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 69 REFERENCES
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned. Expand
Deep Convolutional Neural Networks for Large-scale Speech Tasks
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and investigates how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Expand
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs. Expand
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced. Expand
An analysis of convolutional neural networks for speech recognition
  • J. Huang, Jinyu Li, Y. Gong
  • Computer Science
  • 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices. Expand
Convolutional deep rectifier neural nets for phone recognition
TLDR
This work modified the rectifier network so that it has a convolutional structure, and found that with deep rectifier networks one can attain a similar speech recognition performance than that with sigmoid nets, but without the need for the time-consuming pre-training procedure. Expand
Deep maxout neural networks for speech recognition
TLDR
Experimental results demonstrate that max out networks converge faster, generalize better and are easier to optimize than rectified linear networks and sigmoid networks, and experiments show that maxout networks reduce underfitting and are able to achieve good results without dropout training. Expand
Improving deep neural networks for LVCSR using rectified linear units and dropout
TLDR
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system. Expand
Convolutional maxout neural networks for low-resource speech recognition
TLDR
Experiments on a 24-hour subset of the Switchboard corpus show that the convolutional structure, the maxout nonlinearity and the dropout training all bring superior performances on this task, and the combination of the three technologies achieves over 10.0% relative improvements over a convolutionAL neural network baseline. Expand
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. Expand
...
1
2
3
4
5
...