Segmental Recurrent Neural Networks for End-to-End Speech Recognition

@inproceedings{Lu2016SegmentalRN,
  title={Segmental Recurrent Neural Networks for End-to-End Speech Recognition},
  author={Liang Lu and Lingpeng Kong and Chris Dyer and Noah A. Smith and Steve Renals},
  booktitle={INTERSPEECH},
  year={2016}
}
We study the segmental recurrent neural network for end-to-end acoustic modelling. This model connects the segmental conditional random field (CRF) with a recurrent neural network (RNN) used for feature extraction. Compared to most previous CRF-based acoustic models, it does not rely on an external system to provide features or segmentation boundaries. Instead, this model marginalises out all the possible segmentations, and features are extracted from the RNN trained together with the segmental… 

Figures and Tables from this paper

End-to-End Neural Segmental Models for Speech Recognition
TLDR
This work reviews neural segmental models, which can be viewed as consisting of a neural network-based acoustic encoder and a finite-state transducer decoder and explores training approaches, including multistage versus end-to-end training and multitask training that combines segmental and frame-level losses.
Explorer End-to-end neural segmental models for speech recognition
TLDR
This work reviews neural segmental models, which can be viewed as consisting of a neural network-based acoustic encoder and a finite-state transducer decoder and explores training approaches, including multi-stage vs. end-to-end training and multitask training that combines segmental and frame-level losses.
End-to-End Acoustic Modeling Using Convolutional Neural Networks
End-to-End Speech Recognition Models
TLDR
This thesis proposes a novel approach to ASR with neural attention models and demonstrates the end-to-end speech recognition model, which can directly emit English/Chinese characters or even word pieces given the audio signal.
End-to-End Architectures for Speech Recognition
  • Y. Miao, Florian Metze
  • Computer Science
    New Era for Robust Speech Recognition, Exploiting Deep Learning
  • 2017
TLDR
The EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup, achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline.
TROPE R HCRAESE R PAID I END-TO-END ACOUSTIC MODELING USING CONVOLUTIONAL NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION
TLDR
An end-to-end acoustic modeling approach using convolution neural networks, where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output, which yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training.
Inverted Alignments for End-to-End Automatic Speech Recognition
TLDR
An inverted alignment approach for sequence classification systems like automatic speech recognition (ASR) that naturally incorporates discriminative, artificial-neural-network-based label distributions and allows for a variety of model assumptions, including statistical variants of attention.
Sequence Modeling and Alignment for LVCSR-Systems
TLDR
Two novel approaches to DNN-based ASR are discussed and analyzed, the attention-based encoder–decoder approach, and the (segmental) inverted HMM approach, with specific focus on the sequence alignment behavior of the different approaches.
Convolutional Neural Networks for Raw Speech Recognition
TLDR
CNN-based acoustic model for raw speech signal is discussed, which establishes the relation between rawspeech signal and phones in a data-driven manner and performs better than traditional cepstral fea- ture-based systems.
Multitask Learning with CTC and Segmental CRF for Speech Recognition
TLDR
It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Deep segmental neural networks for speech recognition
TLDR
The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition
TLDR
This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.
Multitask Learning with CTC and Segmental CRF for Speech Recognition
TLDR
It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Joint CTC-attention based end-to-end speech recognition using multi-task learning
TLDR
A novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue.
A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition
TLDR
This paper studies the RNN encoder-decoder approach for large vocabulary end-to-end speech recognition, whereby an encoder transforms a sequence of acoustic vectors into a sequences of feature representations, from which a decoder recovers asequence of words.
Discriminative segmental cascades for feature-rich phone recognition
TLDR
It is shown that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained, and an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices is considered.
Fast and accurate recurrent neural network acoustic models for speech recognition
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More
...
1
2
3
4
...