• Corpus ID: 210698654

Parallel Machine Translation with Disentangled Context Transformer

  title={Parallel Machine Translation with Disentangled Context Transformer},
  author={Jungo Kasai and James Cross and Marjan Ghazvininejad and Jiatao Gu},
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generation process causes fundamental latency in inference since we cannot generate multiple tokens in each sentence in parallel. We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts. The DisCo… 

Figures and Tables from this paper

Non-Autoregressive Machine Translation with Latent Alignments

This paper investigates two latent alignment models for non-autoregressive machine translation, namely CTC and Imputer. CTC generates outputs in a single step, makes strong conditional independence

A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond

This survey conducts a systematic survey with comparisons and discussions of various non-autoregressive translation (NAT) models from different aspects, and categorizes the efforts of NAT into several groups, including data manipulation, modeling methods, training criterion, decoding algorithms, and the benefit from pre-trained models.

Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation

The findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and a new speed-quality baseline for future research toward fast, accurate translation is provided.

Memory Transformer

This work proposes and study two extensions of the Transformer baseline by adding memory tokens to store non-local representations, and creating memory bottleneck for the global information, and evaluates these memory augmented Transformers on machine translation task and demonstrates that memory size positively correlates with the model performance.

POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

POINTER (PrOgressive INsertion-based TransformER), a simple yet novel insertion-based approach for hard-constrained text generation, which achieves state-of-the-art performance on constrained text generation.

Progressive Multi-Granularity Training for Non-Autoregressive Translation

It is empirically shown that NAT models are prone to learn fine-grained lower-mode knowledge, such as words and phrases, compared with sentences, and proposed progressive multigranularity training for NAT is proposed, resulting in better translation quality against strong NAT baselines.

Pointer: Constrained Text Generation via Insertion-based Generative Pre-training

POINTER, a simple yet novel insertion-based approach for hard-constrained text generation, operates by progressively inserting new tokens between existing tokens in a parallel manner, and achieves state-of-the-art performance on constrained text generation.

Context-Aware Cross-Attention for Non-Autoregressive Translation

Experimental results show that this approach can consistently improve translation quality over strong NAT baselines and extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

This study empirically shows that as a side effect of training non-autoregressive translation models, the lexical choice errors on low-frequency words are propagated to the NAT model from the teacher model, and proposes to expose the raw data to NAT models to restore the useful information of low- Frequency words, which are missed in the distilled data.

Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

This work directly exposes the raw data into NAT by leveraging pretraining to rejuvenate more alignments for low-frequency target words, and combines these complementary approaches as a new training strategy for further boosting NAT performance.



Non-Autoregressive Neural Machine Translation

A model is introduced that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference, and achieves near-state-of-the-art performance on WMT 2016 English-Romanian.

Levenshtein Transformer

Levenshtein Transformer is developed, a new partially autoregressive model devised for more flexible and amenable sequence generation and a set of new training techniques dedicated at them, effectively exploiting one as the other's learning signal thanks to their complementary nature.

Non-Autoregressive Machine Translation with Auxiliary Regularization

This paper proposes to address the issues of repeated translations and incomplete translations in NAT models by improving the quality of decoder hidden representations via two auxiliary regularization terms in the training process of an NAT model.

Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information

A novel NAT framework ReorderNAT is proposed which explicitly models the reordering information to guide the decoding of NAT and achieves better performance compared to most existing NAT models, and even achieves comparable translation quality as autoregressive translation models with a significant speedup.

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

This model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average, and is able to reach within about 1 BLEu point of a typical left-to-right transformer model, while decoding significantly faster.

Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation

This paper proposes to train NAT to minimize the Bag-of-Ngrams (BoN) difference between the model output and the reference sentence, and shows that this approach largely outperforms the NAT baseline on three translation tasks.

Fast Structured Decoding for Sequence Models

This work designs an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and proposes a dynamic transition technique to model positional contexts in the CRF and shows that while increasing little latency, this model could achieve significantly better translation performance than previous non- autoregressive models on different translation datasets.

Hint-Based Training for Non-Autoregressive Machine Translation

A novel approach to leveraging the hints from hidden states and word alignments to help the training of N ART models achieves significant improvement over previous NART models for the WMT14 En-De and De-En datasets and are even comparable to a strong LSTM-based ART baseline but one order of magnitude faster in inference.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

This work presents a novel non-autoregressive architecture based on connectionist temporal classification and evaluates it on the task of neural machine translation, which achieves a significant speedup over the autoregressive models, keeping the translation quality comparable to other non-AUTOgressive models.