End-to-End Neural Segmental Models for Speech Recognition

@article{Tang2017EndtoEndNS,
  title={End-to-End Neural Segmental Models for Speech Recognition},
  author={Hao Tang and Liang Lu and Lingpeng Kong and Kevin Gimpel and Karen Livescu and Chris Dyer and Noah A. Smith and Steve Renals},
  journal={IEEE Journal of Selected Topics in Signal Processing},
  year={2017},
  volume={11},
  pages={1254-1264}
}
  • Hao Tang, Liang Lu, S. Renals
  • Published 1 August 2017
  • Computer Science
  • IEEE Journal of Selected Topics in Signal Processing
Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed… 

Figures and Tables from this paper

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

TLDR
This work considers segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments, and finds that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs and AGWEs.

LETTER-BASED SPEECH RECOGNITION

TLDR
This paper proposes a letter-based speech recognition system, leveraging a ConvNet acoustic model, key ingredients of the ConvNet are Gated Linear Units and high dropout.

Recent Advances in End-to-End Automatic Speech Recognition

  • Jinyu Li
  • Computer Science
    APSIPA Transactions on Signal and Information Processing
  • 2022
TLDR
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.

Learning to Count Words in Fluent Speech Enables Online Speech Recognition

TLDR
This work introduces Taris, a Transformer-based online speech recognition system aided by an auxiliary task of incremental word counting that uses the cumulative word sum to dynamically segment speech and enable its eager decoding into words.

Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

TLDR
A segment-based unsupervised clustering algorithm to re-assign class labels to the segments of speech utterances and improves the TD-SV performance of TCL-BN and ASR derived BN features with respect to their standalone counterparts.

Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model

TLDR
A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes.

Letter-Based Speech Recognition with Gated ConvNets

TLDR
A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.

On The Inductive Bias of Words in Acoustics-to-Word Models

TLDR
This work studies the optimization and generalization of acoustics-to-word models under different amounts of training data, and analyzes the word embedding space learned by the model, finding that the space has a structure dominated by the pronunciation of words.

Neural Representation Learning in Linguistic Structured Prediction

TLDR
This thesis argues for the importance of modeling discrete structure in language, even when learning continuous representations, and proposes dynamic recurrent acyclic graphical neural networks (DRAGNN), a modular neural architecture that generalizes the encoder/decoder concept to include explicit linguistic structures.

Knowledge Distillation for Sequence Model

Knowledge distillation, or teacher-student training, has been effectively used to improve the performance of a relatively simpler deep learning model (the student) using a more complex model (the

References

SHOWING 1-10 OF 62 REFERENCES

Segmental Recurrent Neural Networks for End-to-End Speech Recognition

TLDR
Practical training and decoding issues as well as the method to speed up the training in the context of speech recognition are discussed, and the model is self-contained and can be trained end-to-end.

Deep segmental neural networks for speech recognition

TLDR
The deep segmental neural network (DSNN) is proposed, a segmental model that uses DNNs to estimate the acoustic scores of phonemic or sub-phonemic segments with variable lengths, which allows the DSNN to represent each segment as a single unit, in which frames are made dependent on each other.

End-to-end attention-based large vocabulary speech recognition

TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition

TLDR
This paper proposes to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional

A comparison of training approaches for discriminative segmental models

TLDR
This paper investigates various losses and introduces a new cost function for training segmental models and compares lattice rescoring results for multiple tasks and studies the impact of several choices required when optimizing these losses.

Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

TLDR
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.

Multitask Learning with CTC and Segmental CRF for Speech Recognition

TLDR
It is found that this multitask objective improves recognition accuracy when decoding with either the SCRF or CTC models, and it is shown that CTC can also be used to pretrain the RNN encoder, which improves the convergence rate when learning the joint model.

Simplifying long short-term memory acoustic models for fast training and decoding

TLDR
To accelerate decoding of LSTMs, it is proposed to apply frame skipping during training, and frame skipping and posterior copying (FSPC) during decoding to resolve two challenges faced by LSTM models: high model complexity and poor decoding efficiency.
...