End-to-End Neural Segmental Models for Speech Recognition

  title={End-to-End Neural Segmental Models for Speech Recognition},
  author={Hao Tang and Liang Lu and Lingpeng Kong and Kevin Gimpel and Karen Livescu and Chris Dyer and Noah A. Smith and Steve Renals},
  journal={IEEE Journal of Selected Topics in Signal Processing},
  • Hao Tang, Liang Lu, S. Renals
  • Published 1 August 2017
  • Computer Science
  • IEEE Journal of Selected Topics in Signal Processing
Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed… 

Figures and Tables from this paper

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

This work considers segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments, and finds that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs and AGWEs.


This paper proposes a letter-based speech recognition system, leveraging a ConvNet acoustic model, key ingredients of the ConvNet are Gated Linear Units and high dropout.

Recent Advances in End-to-End Automatic Speech Recognition

  • Jinyu Li
  • Computer Science
    APSIPA Transactions on Signal and Information Processing
  • 2022
This paper will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

A segment-based unsupervised clustering algorithm to re-assign class labels to the segments of speech utterances and improves the TD-SV performance of TCL-BN and ASR derived BN features with respect to their standalone counterparts.

Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model

A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes.

Letter-Based Speech Recognition with Gated ConvNets

A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.

On The Inductive Bias of Words in Acoustics-to-Word Models

This work studies the optimization and generalization of acoustics-to-word models under different amounts of training data, and analyzes the word embedding space learned by the model, finding that the space has a structure dominated by the pronunciation of words.

Neural Representation Learning in Linguistic Structured Prediction

This thesis argues for the importance of modeling discrete structure in language, even when learning continuous representations, and proposes dynamic recurrent acyclic graphical neural networks (DRAGNN), a modular neural architecture that generalizes the encoder/decoder concept to include explicit linguistic structures.

Knowledge Distillation for Sequence Model

Knowledge distillation, or teacher-student training, has been effectively used to improve the performance of a relatively simpler deep learning model (the student) using a more complex model (the

On the Difficulty of Segmenting Words with Attention

In experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task, suggesting that attention-based segmentation is only useful in limited scenarios.



Deep segmental neural networks for speech recognition

The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.

End-to-end attention-based large vocabulary speech recognition

This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition

This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Discriminative segmental cascades for feature-rich phone recognition

It is shown that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained, and an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices is considered.

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional

A comparison of training approaches for discriminative segmental models

This paper investigates various losses and introduces a new cost function for training segmental models and compares lattice rescoring results for multiple tasks and studies the impact of several choices required when optimizing these losses.

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.

Multitask Learning with CTC and Segmental CRF for Speech Recognition

It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.

Simplifying long short-term memory acoustic models for fast training and decoding

To accelerate decoding of LSTMs, it is proposed to apply frame skipping during training, and frame skipping and posterior copying (FSPC) during decoding to resolve two challenges faced by LSTM models: high model complexity and poor decoding efficiency.