Efficient Inference For Neural Machine Translation

  title={Efficient Inference For Neural Machine Translation},
  author={Yi-Te Hsu and Sarthak Garg and Yi-Hsiu Liao and Ilya Chatsviorkin},
Large Transformer models have achieved state-of-the-art results in neural machine translation and have become standard in the field. In this work, we look for the optimal combination of known techniques to optimize inference speed without sacrificing translation quality. We conduct an empirical study that stacks various approaches and demonstrates that combination of replacing decoder self-attention with simplified recurrent units, adopting a deep encoder and a shallow decoder architecture and… Expand

Figures and Tables from this paper

LightSeq: A High Performance Inference Library for Transformers
A highly efficient inference library for models in the Transformer family that includes a series of GPU optimization techniques to both streamline the computation of Transformer layers and reduce memory footprint. Expand
Bag of Tricks for Optimizing Transformer Efficiency
  • Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu
  • Computer Science
  • 2021
Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods areExpand
Efficient Inference for Multilingual Neural Machine Translation
Multilingual NMT has become an attractive solution for MT deployment in production. But to match bilingual quality, it comes at the cost of larger and slower models. In this work, we consider severalExpand
When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
SRU++ is presented, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency and reaffirm that attention is not all the authors need and can be complementary to other sequential modeling modules. Expand


Deep architectures for Neural Machine Translation
This work describes and evaluates several existing approaches to introduce depth in neural machine translation, and introduces a novel "BiDeep" RNN architecture that combines deep transition RNNs and stacked RNNS. Expand
Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
The findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and a new speed-quality baseline for future research toward fast, accurate translation is provided. Expand
Learning Deep Transformer Models for Machine Translation
It is claimed that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. Expand
Incorporating BERT into Neural Machine Translation
A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. Expand
Simple Recurrent Units for Highly Parallelizable Recurrence
The Simple Recurrent Unit is proposed, a light recurrent unit that balances model capacity and scalability, designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. Expand
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Understanding Back-Translation at Scale
This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Expand
Accelerating Neural Transformer via an Average Attention Network
The proposed average attention network is applied on the decoder part of the neural Transformer to replace the original target-side self-attention model and enables the neuralTransformer to decode sentences over four times faster than its original version with almost no loss in training time and translation performance. Expand
From Research to Production and Back: Ludicrously Fast Neural Machine Translation
Improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models, and for efficient CPU-based decoding, a pre-packed 8-bit matrix products and improved batched decoding are proposed. Expand