A Study of Non-autoregressive Model for Sequence Generation

@article{Ren2020ASO,
  title={A Study of Non-autoregressive Model for Sequence Generation},
  author={Yi Ren and Jinglin Liu and Xu Tan and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.10454}
}
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel, resulting in faster generation speed compared to their autoregressive (AR) counterparts but at the cost of lower accuracy. Different techniques including knowledge distillation and source-target alignment have been proposed to bridge the gap between AR and NAR models in various tasks such as neural machine translation (NMT), automatic speech recognition (ASR), and text to speech (TTS). With the help of those… 

Figures and Tables from this paper

Non-Autoregressive Sequence Generation
TLDR
This tutorial will provide a thorough introduction and review of non-autoregressive sequence generation, in four sections: Background, which covers the motivation of NAR generation, the problem definition, the evaluation protocol, and the comparison with standard autoregressive generation approaches.
A Self-Paced Mixed Distillation Method for Non-Autoregressive Generation
TLDR
A novel self-paced mixed distillation method based on the AR stream knowledge that improves the generation quality and has no influence on the inference latency of BANG and achieves more than 7x speedup.
A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond
TLDR
This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.
ORTHROS: non-autoregressive end-to-end speech translation With dual-decoder
TLDR
A novel NAR E2E-ST framework is proposed, Orthros, in which both NAR and autoregressive decoders are jointly trained on the shared speech encoder, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade
TLDR
This work first inspects the fundamental issues of fully NAT models, and adopt dependency reduction in the learning space of output tokens as the primary guidance, and revisit methods in four different aspects that have been proven effective for improving NAT models and carefully combine these techniques with necessary modifications.
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire
TLDR
Three methods to reduce the gap between FastLR and AR model are introduced, which leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence and add an auxiliary autoregressive decoder to help the feature extraction of encoder.
An Effective Non-Autoregressive Model for Spoken Language Understanding
TLDR
A novel non-autoregressive SLU model named Layered-Refine Transformer is proposed, which contains a Slot Label Generation (SLG) task and a Layered Refine Mechanism (LRM), which can efficiently obtain dependency information during training and spend no extra time in inference.
How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
TLDR
This paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical Diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.
Understanding and Improving Lexical Choice in Non-Autoregressive Translation
TLDR
This study empirically shows that as a side effect of training non-autoregressive translation models, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model, and proposes to expose the raw data to NAT models to restore the useful information of low- Frequency words, which are missed in the distilled data.
Progressive Multi-Granularity Training for Non-Autoregressive Translation
TLDR
It is empirically shown that NAT models are prone to learn fine-grained lower-mode knowledge, such as words and phrases, compared with sentences, and proposed progressive multigranularity training for NAT is proposed, resulting in better translation quality against strong NAT baselines.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Understanding Knowledge Distillation in Non-autoregressive Machine Translation
TLDR
It is found that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data, and a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality.
Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation
TLDR
This work designs a curriculum in the fine-tuning process to progressively switch the training from autoregressive generation to non-autoregressive Generation for NAT, and achieves good improvement over previous NAT baselines in terms of translation accuracy.
Hint-Based Training for Non-Autoregressive Machine Translation
TLDR
A novel approach to leveraging the hints from hidden states and word alignments to help the training of N ART models achieves significant improvement over previous NART models for the WMT14 En-De and De-En datasets and are even comparable to a strong LSTM-based ART baseline but one order of magnitude faster in inference.
Non-Autoregressive Neural Machine Translation
TLDR
A model is introduced that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference, and achieves near-state-of-the-art performance on WMT 2016 English-Romanian.
A Comparative Study on Transformer vs RNN in Speech Applications
TLDR
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input
TLDR
This paper proposes two methods to enhance the decoder inputs so as to improve NAT models, one directly leverages a phrase table generated by conventional SMT approaches to translate source tokens to target tokens, and the other transforms source-side word embeddings to target-side words through sentence-level alignment and word-level adversary learning.
Non-Autoregressive Machine Translation with Auxiliary Regularization
TLDR
This paper proposes to address the issues of repeated translations and incomplete translations in NAT models by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model.
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Non-Autoregressive Transformer Automatic Speech Recognition
TLDR
A novel non-autoregressive transformers structure for speech recognition, originally introduced in machine translation, which can support different decoding strategies, including traditional left-to-right, is studied.
Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition
TLDR
Results on Mandarin (Aishell) and Japanese ASR benchmarks show the possibility to train such a non-autoregressive network for ASR and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup.
...
...