• Corpus ID: 1921173

Attention-Based Models for Speech Recognition

@inproceedings{Chorowski2015AttentionBasedMF,
  title={Attention-Based Models for Speech Recognition},
  author={Jan Chorowski and Dzmitry Bahdanau and Dmitriy Serdyuk and Kyunghyun Cho and Yoshua Bengio},
  booktitle={NIPS},
  year={2015}
}
Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [1,2] and image caption generation [3. [] Key Method We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue.

Figures from this paper

Unidirectional Memory-Self-Attention Transducer for Online Speech Recognition
TLDR
The experiments demonstrate that the proposed models improve WER results than Restricted-Self-Attention models by 13.5% on WSJ and 7.1% on SWBD datasets relatively, and without much computation costs increase.
A Time-Restricted Self-Attention Layer for ASR
TLDR
This paper applies a restricted self-attention mechanism (with multiple heads) to speech recognition, and tries introducing attention layers into TDNN architectures, and replacing LSTM layers with attention layers in TDNN+LSTM architectures.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
End-to-end attention-based large vocabulary speech recognition
TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.
Attention-Based End-to-End Speech Recognition on Voice Search
TLDR
This paper uses character embedding to deal with the large vocabulary of Mandarin speech recognition and compares two attention mechanisms and uses attention smoothing to cover long context in the attention model.
An Analysis of Local Monotonic Attention Variants
TLDR
A simple technique to implement windowed attention, which can be applied on top of an existing global attention model, and it is shown that the proposed model can be trained from random initialization and achieve results comparable to the global attention baseline.
Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition
TLDR
This work designs an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences (few encoder and decoder states) and learns to mimic the same alignment between the current input short speech segments and the transcription.
Neural Incremental Speech Recognition Through Attention Transfer
TLDR
This work constructed incremental ASR (ISR) for a low-latency recognition by exploiting an attentionbased non-incremental ASR framework that is treated as a teacher to teach the ISR through attention transfer.
Attention-Based End-to-End Speech Recognition in Mandarin
TLDR
This paper explores the use of attention-based encoder-decoder model for Mandarin speech recognition and achieves the first promising result and reduces the source sequence length by skipping frames and regularize the weights for better generalization and convergence.
Encoder Transfer for Attention-based Acoustic-to-word Speech Recognition
TLDR
Domain adaptation based on transfer learning with layer freezing is proposed for adaptation of the latent linguistic capability of the decoder to the target domain and the models trained with the proposed method achieved better accuracy than the baseline models.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Deep Speech: Scaling up end-to-end speech recognition
TLDR
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition
  • L. Tóth
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
TLDR
Initial results demonstrate that this new approach achieves phoneme error rates that are comparable to the state-of-the-art HMM-based decoders, on the TIMIT dataset.
Sequence Transduction with Recurrent Neural Networks
TLDR
This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence.
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
The Application of Hidden Markov Models in Speech Recognition
TLDR
The aim of this review is first to present the core architecture of a HMM-based LVCSR system and then to describe the various refinements which are needed to achieve state-of-the-art performance.
...
1
2
3
4
5
...