Corpus ID: 237503022

Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition

@article{Zhang2021NonautoregressiveTW,
  title={Non-autoregressive Transformer with Unified Bidirectional Decoder for Automatic Speech Recognition},
  author={Chuan-Fei Zhang and Yan Liu and Tian-Hao Zhang and Song-Lu Chen and Feng Chen and Xu-Cheng Yin},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.06684}
}
  • Chuan-Fei Zhang, Yan Liu, +3 authors Xu-Cheng Yin
  • Published 14 September 2021
  • Computer Science, Engineering
  • ArXiv
Non-autoregressive (NAR) transformer models have been studied intensively in automatic speech recognition (ASR), and a substantial part of NAR transformer models is to use the casual mask to limit token dependencies. However, the casual mask is designed for the left-to-right decoding process of the non-parallel autoregressive (AR) transformer, which is inappropriate for the parallel NAR transformer since it ignores the right-to-left contexts. Some models are proposed to utilize right-to-left… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 37 REFERENCES
Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition
TLDR
Results on Mandarin (Aishell) and Japanese ASR benchmarks show the possibility to train such a non-autoregressive network for ASR and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Expand
Transformer with Bidirectional Decoder for Speech Recognition
TLDR
This work introduces a bidirectional speech transformer to utilize the different directional contexts simultaneously, and uses the introduced biddirectional beam search method to generate left-to-right candidates but also right- to-left candidates and determine the best hypothesis by the score. Expand
Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
TLDR
This work proposes a spike-triggered non-autoregressive transformer model for end-to-end speech recognition, which introduces a CTC module to predict the length of the target sequence and accelerate the convergence. Expand
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition
  • Linhao Dong, Shuang Xu, Bo Xu
  • Computer Science
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
The Speech-Transformer is presented, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency and a 2D-Attention mechanism which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech- Transformer. Expand
Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
TLDR
This work integrates connectionist temporal classification (CTC) with Transformer for joint training and decoding of automatic speech recognition (ASR) tasks and makes training faster than with RNNs and assists LM integration. Expand
Speech Transformer with Speaker Aware Persistent Memory
TLDR
This paper proposes to conduct speaker aware training for ASR in transformer model to embed speaker knowledge through a persistent memory model into speech transformer encoder at utterance level, and achieves superior results compared with other models with the same objective. Expand
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. Expand
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, +11 authors M. Bacchiani
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention. Expand
Non-Autoregressive Neural Machine Translation
TLDR
A model is introduced that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference, and achieves near-state-of-the-art performance on WMT 2016 English-Romanian. Expand
Aligned Cross Entropy for Non-Autoregressive Machine Translation
TLDR
Aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models and AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks, while setting a new state of the art for non-AUTOgressive models. Expand
...
1
2
3
4
...