Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

@article{Chan2016ListenAA,
  title={Listen, attend and spell: A neural network for large vocabulary conversational speech recognition},
  author={William Chan and Navdeep Jaitly and Quoc V. Le and Oriol Vinyals},
  journal={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2016},
  pages={4960-4964}
}
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. [] Key Method Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs.

Figures and Tables from this paper

Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition
TLDR
A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.
Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition
TLDR
A new unidirectional neural network architecture of parallel time-delayed LSTM (PTDLSTM) streams is proposed, which limits the processing latency to 250 ms and shows significant improvements compared to prior art on a variety of ASR tasks.
Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition
TLDR
It is shown that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode.
End-to-End Speech Recognition Models
TLDR
This thesis proposes a novel approach to ASR with neural attention models and demonstrates the end-to-end speech recognition model, which can directly emit English/Chinese characters or even word pieces given the audio signal.
Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances
TLDR
A separate length prediction model is created to predict the correct number of wordpieces in the output, which allows us to identify and truncate problematic decoding results without increasing word error rates on the LibriSpeech task.
An Analysis of Decoding for Attention-Based End-to-End Mandarin Speech Recognition
TLDR
This paper investigates the decode process for attention-based Mandarin models using syllable and character as acoustic modeling units and discusses how to combine word information into the decoding process and conducts a detailed analysis on various factors that affect the performance of decoding.
End-To-End Multi-Talker Overlapping Speech Recognition
  • Anshuman Tripathi, Han Lu, H. Sak
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
In this paper we present an end-to-end speech recognition system that can recognize single-channel speech where multiple talkers can speak at the same time (overlapping speech) by using a neural
Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End
TLDR
This study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR).
Bengali speech recognition: A double layered LSTM-RNN approach
TLDR
This paper has investigated long short term memory (LSTM), a recurrent neural network, approach to recognize individual Bengali words, and divided each word into a number of frames each containing 13 mel-frequency cepstral coefficients (MFCC), providing a useful set of distinctive features.
End-to-End Online Speech Recognition with Recurrent Neural Networks
TLDR
An efficient GPUbased RNN training framework for the truncated backpropagation through time (BPTT) algorithm, which is suitable for online (continuous) training, and an online version of the connectionist temporal classification (CTC) loss computation algorithm, where the original CTC loss is estimated with partial sliding window.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Lexicon-Free Conversational Speech Recognition with Neural Networks
TLDR
An approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks.
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
TLDR
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Hybrid speech recognition with Deep Bidirectional LSTM
TLDR
The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Learning acoustic frame labeling for speech recognition with recurrent neural networks
  • H. Sak, A. Senior, J. Schalkwyk
  • Computer Science, Physics
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
It is shown that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTm RNN model trained with CE using HMM state alignments, and the effect of sequence discriminative training on these models is shown.
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
...
...