Marian: Cost-effective High-Quality Neural Machine Translation in C++

  title={Marian: Cost-effective High-Quality Neural Machine Translation in C++},
  author={Marcin Junczys-Dowmunt and Kenneth Heafield and Hieu T. Hoang and Roman Grundkiewicz and Anthony Aue},
This paper describes the submissions of the “Marian” team to the WNMT 2018 shared task. We investigate combinations of teacher-student training, low-precision matrix products, auto-tuning and other methods to optimize the Transformer model on GPU and CPU. By further integrating these methods with the new averaging attention networks, a recently introduced faster Transformer variant, we create a number of high-quality, high-performance models on the GPU and CPU, dominating the Pareto frontier… 

Figures and Tables from this paper

From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models, and for efficient CPU-based decoding, a pre-packed 8-bit matrix products and improved batched decoding are proposed.

Edinburgh Research Explorer From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Improved teacher-student training via multi-agent dual-learning and noisy backward-forward translation for Transformer-based student models and push the Pareto frontier established during the 2018 edition towards 24x (CPU) and 14x (GPU) faster models at comparable or higher BLEU values.

Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task

In the shared task, most of the submissions were Pareto optimal with respect the trade-off between time and quality.

Speed-optimized, Compact Student Models that Distill Knowledge from a Larger Teacher Model: the UEDIN-CUNI Submission to the WMT 2020 News Translation Task

This work describes the joint submission of the University of Edinburgh and Charles University, Prague, to the Czech/English track in the WMT 2020 Shared Task on News Translation and achieves translation speeds of over 700 whitespace-delimited source words per second on a single CPU thread, thus making neural translation feasible on consumer hardware without a GPU.

Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation

The lottery ticket hypothesis is applied to prune heads in the early stages of training and shows that it is possible to remove up to three-quarters of attention heads from transformer-big during early training with an average -0.1 change in BLEU for Turkish→English.

The Highs and Lows of Simple Lexical Domain Adaptation Approaches for Neural Machine Translation

Two approaches to alleviateMachine translation systems vulnerable to domain mismatch are adopted: lexical shortlisting restricted by IBM statistical alignments, and hypothesis reranking based on similarity.

Q8BERT: Quantized 8Bit BERT

This work shows how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by 4x with minimal accuracy loss and the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.

Few-shot learning through contextual data augmentation

A data augmentation approach is extended using a pretrained language model to create training examples with similar contexts for novel words to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples.

Cheat Codes to Quantify Missing Source Information in Neural Machine Translation

A method to quantify the amount of information added by the target sentence t that is not present in the source s in a neural machine translation system that is provided in a highly compressed form (a “cheat code”) is described.

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU, while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.



Marian: Fast Neural Machine Translation in C++

Marian is an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs that can achieve high training and translation speed.

Sockeye: A Toolkit for Neural Machine Translation

This paper highlights Sockeye's features and benchmark it against other NMT toolkits on two language arcs from the 2017 Conference on Machine Translation (WMT): English-German and Latvian-English, and reports competitive BLEU scores across all three architectures.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions

It is demonstrated that current neural machine translation could already be used for in-production systems when comparing words-persecond ratios, and aspects of translation speed are investigated, introducing AmuNMT, the authors' efficient neural machinetranslation decoder.

Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

This work proposes a simple but powerful network architecture which uses an RNN (GRU/LSTM) layer at bottom, followed by a series of stacked fully-connected layers applied at every timestep, which achieves similar accuracy to a deep recurrent model, at a small fraction of the training and decoding cost.

Findings of the Second Workshop on Neural Machine Translation and Generation

The results of the workshop’s shared task on efficient neural machine translation are described, where participants were tasked with creating MT systems that are both accurate and efficient.

Nematus: a Toolkit for Neural Machine Translation

Nematus is a toolkit for Neural Machine Translation that prioritizes high translation accuracy, usability, and extensibility and was used to build top-performing submissions to shared translation tasks at WMT and IWSLT.

A Simple, Fast, and Effective Reparameterization of IBM Model 2

We present a simple log-linear reparameterization of IBM Model 2 that overcomes problems arising from Model 1’s strong assumptions and Model 2’s overparameterization. Efficient inference, likelihood

Accelerating Neural Transformer via an Average Attention Network

The proposed average attention network is applied on the decoder part of the neural Transformer to replace the original target-side self-attention model and enables the neuralTransformer to decode sentences over four times faster than its original version with almost no loss in training time and translation performance.

The University of Edinburgh’s Neural MT Systems for WMT17

The University of Edinburgh's submissions to the WMT17 shared news translation and biomedical translation tasks are described, with novelties this year include the use of deep architectures, layer normalization, and more compact models due to weight tying and improvements in BPE segmentations.