• Corpus ID: 235613631

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

  title={Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation},
  author={Jungo Kasai and Nikolaos Pappas and Hao Peng and James Cross and Noah A. Smith},
Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work… 

Figures and Tables from this paper

Finetuning Pretrained Transformers into RNNs

This work proposes a swap-then-finetune procedure, which in an off-the-shelf pretrained transformer, replaces the softmax attention with its linear-complexity recurrent alternative and then finetune, and provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants.

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

This paper introduces a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion and paves the way for highly performant token-free models that are trained completely end-to-end.

Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

Orderagnostic cross entropy (OAXE) improves the standard cross-entropy loss to ameliorate the effect of word reordering, which is a common source of the critical multimodality problem in NAT models.

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

This work shows that translation already happens progressively in encoder layers and even in the input embeddings, and suggests a Transformer configuration change that can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.4.

Diversifying Dialog Generation via Adaptive Label Smoothing

An Adaptive Label Smoothing (AdaLabel) approach that can adaptively estimate a target label distribution at each time step for different contexts, which outperforms various competitive baselines in producing diverse responses.

Non-Autoregressive Neural Machine Translation: A Call for Clarity

This work revisit several techniques that have been proposed for improving non-autoregressive translation models and compare their combined translation quality and speed implications under third-party testing environ-ments and provides novel insights for establishing strong baselines using length prediction or CTC-based architecture variants.

Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task

The participation of the BSC team in the WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task aims to solve the Subtask 2: Wikipedia cultural heritage articles, which involves translation in four Romance languages: Catalan, Italian, Occitan and Romanian.


This work presents a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one that significantly improves the inference time and space efficiency with no or negligible accuracy loss.

ABC: Attention with Bounded-memory Control

This work shows that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and it outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.

Scaling Laws for Neural Machine Translation

A formula is proposed which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and it is shown that it gives accurate predictions under a variety of scaling approaches and languages.



Non-Autoregressive Machine Translation with Latent Alignments

This paper investigates two latent alignment models for non-autoregressive machine translation, namely CTC and Imputer. CTC generates outputs in a single step, makes strong conditional independence

Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information

A novel NAT framework ReorderNAT is proposed which explicitly models the reordering information to guide the decoding of NAT and achieves better performance compared to most existing NAT models, and even achieves comparable translation quality as autoregressive translation models with a significant speedup.

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

This model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average, and is able to reach within about 1 BLEu point of a typical left-to-right transformer model, while decoding significantly faster.

A Call for Clarity in Reporting BLEU Scores

Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this.

Non-autoregressive Machine Translation with Disentangled Context Transformer

An attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts that achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.

From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models, and for efficient CPU-based decoding, a pre-packed 8-bit matrix products and improved batched decoding are proposed.

Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior

Inspired by recent refinement-based approaches, LaNMT is proposed, a latent-variable non-autoregressive model with continuous latent variables and deterministic inference procedure that closes the performance gap between non- Autoregressive and autoregressive approaches on ASPEC Ja-En dataset with 8.6x faster decoding.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Sequence-Level Knowledge Distillation

It is demonstrated that standard knowledge distillation applied to word-level prediction can be effective for NMT, and two novel sequence-level versions of knowledge distilling are introduced that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search.

Distilling the Knowledge in a Neural Network

This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.