• Corpus ID: 224814118

Beyond English-Centric Multilingual Machine Translation

@article{Fan2020BeyondEM,
  title={Beyond English-Centric Multilingual Machine Translation},
  author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Çelebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.11125}
}
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate… 
Back-translation for Large-Scale Multilingual Machine Translation
TLDR
Surprisingly, the smaller size of vocabularies perform better, and the extensive monolingual English data offers a modest improvement in multilingual translation performance.
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
TLDR
This work proposes mCOLT, a training method to obtain a single unified multilingual translation model, empowered by a contrastive learning scheme to close the gap among representations of different languages, and data augmentation on both multiple parallel and monolingual data to further align token representations.
Adapting Multilingual Models for Code-Mixed Translation using Back-to-Back Translation
  • 2021
In this paper, we explore the problem of translating code-mixed sentences to an equivalent monolingual form. The scarcity of gold standard code-mixed to pure language parallel data makes it difficult
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
TLDR
A continual pre-training (CPT) framework on mBART to effectively adapt it to unseen languages and can consistently improve the finetuning performance upon the mBart baseline, as well as other strong baselines, across all tested low-resource translation pairs containing unseen languages.
Multilingual natural language processing: Towards universal translation with neural approaches
Main text: The quality of machine translation is approaching human parity for several language pairs [10]. The field is progressing rapidly further with recent advances in natural language processing
Multilingual Translation from Denoising Pre-Training
TLDR
It is found that multilingual finetuning can significantly improve over multilingual models trained from scratch for zero-shot translation on non-English directions and the ML50 benchmark is created to facilitate reproducible research by standardizing training and evaluation data.
XLM-T: Scaling up Multilingual Machine Translation with Pretrained Cross-lingual Transformer Encoders
TLDR
This work presents XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer encoder and finetunes it with multilingual parallel data and explains its effectiveness for machine translation.
MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation
TLDR
This paper presents MENYO20k, the first multi-domain parallel corpus for the low-resource Yorùbá–English (yo–en) language pair with standardized train-test splits for benchmarking and provides several neural MT (NMT) benchmarks on this dataset, showing that, in almost all cases, the simple benchmarks outperform the pre-trained MT models.
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
TLDR
The FLORES-101 evaluation benchmark is introduced, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains that enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems.
Towards Making the Most of Multilingual Pretraining for Zero-Shot Neural Machine Translation
TLDR
SixT++ is a strong many-to-English NMT model that supports 100 source languages but is trained once with a parallel dataset from only six source languages, and outperforms all current state-of-the-art unsupervised methods on Nepali and Sinhal for both translating into and from English.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 141 REFERENCES
Multilingual Neural Machine Translation for Low Resource Languages
TLDR
This work shows how the so-called multilingual NMT can help to tackle the challenges associated with low-resourced language translation, and introduces the recently proposed iterative self-training method, which incrementally improves a mult bilingual NMT on a zero-shot direction by just relying on monolingual data.
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
TLDR
It is argued that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures.
Massively Multilingual Neural Machine Translation
TLDR
It is shown that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages in 116 translation directions in a single model.
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
TLDR
This work shows that multilingual translation models can be created through multilingual finetuning, and demonstrates that pretrained models can been extended to incorporate additional languages without loss of performance.
Multilingual Neural Machine Translation with Language Clustering
TLDR
This work develops a framework that clusters languages into different groups and trains one multilingual model for each cluster and obtains the embedding vectors of all the languages by training a universal neural machine translation model.
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
TLDR
This work sets a milestone by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples, and demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines.
A Study of Multilingual Neural Machine Translation
TLDR
This paper conducts a comprehensive study on a multilingual dataset with more than 20 languages and shows that low-resource language pairs benefit much from multilingual training, while rich- resource language pairs may get hurt under limited model capacity and training with similar languages benefits more than dissimilar languages.
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation
TLDR
This work demonstrates the efficacy of monolingual data with self-supervision in multilingual NMT and offers a viable path towards adding new languages to multilingual models, getting up to 33 BLEU on ro-en translation without any parallel data or back-translation.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Universal Neural Machine Translation for Extremely Low Resource Languages
TLDR
The proposed approach utilizing a transfer-learning approach to share lexical and sentence level representations across multiple source languages into one target language is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences.
...
1
2
3
4
5
...