Corpus ID: 235829420

The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation

@inproceedings{Adelani2021TheEO,
  title={The Effect of Domain and Diacritics in Yoruba–English Neural Machine Translation},
  author={D. Adelani and Dana Ruiter and Jesujoba Oluwadara Alabi and Damilola Adebonojo and Adesina Ayeni and Mofetoluwa Adeyemi and Ayodele Awokoya and Cristina Espa{\~n}a-Bonet},
  booktitle={MTSUMMIT},
  year={2021}
}
Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus… Expand

Figures and Tables from this paper

Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages
TLDR
The combination of multilingual denoising autoencoding, SSNMT with backtranslation and bilingual finetuning enables us to learn machine translation even for distant language pairs for which only small amounts of monolingual data are available. Expand

References

SHOWING 1-10 OF 46 REFERENCES
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
TLDR
This work introduces the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia, and demonstrates that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Expand
Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of Yorùbá and Twi
TLDR
This paper focuses on two African languages, Yorùbá and Twi, and uses different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. Expand
Revisiting Low-Resource Neural Machine Translation: A Case Study
TLDR
It is shown that, without the use of any auxiliary monolingual or multilingual data, an optimized NMT system can outperform PBSMT with far less data than previously claimed. Expand
Copied Monolingual Data Improves Low-Resource Neural Machine Translation
We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, weExpand
Igbo-English Machine Translation: An Evaluation Benchmark
TLDR
This work discusses the effort toward building a standard machine translation benchmark dataset for Igbo, one of the 3 major Nigerian languages and a low resourced language for NLP research. Expand
Beyond English-Centric Multilingual Machine Translation
TLDR
This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Expand
On Using Monolingual Corpora in Neural Machine Translation
TLDR
This work investigates how to leverage abundant monolingual corpora for neural machine translation to improve results for En-Fr and En-De translation and extends to high resource languages such as Cs-En and De-En. Expand
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
TLDR
It is shown that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences and is achieved a new state-of-the-art for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French. Expand
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
TLDR
This work sets a milestone by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples, and demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. Expand
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
TLDR
This work shows that multilingual translation models can be created through multilingual finetuning, and demonstrates that pretrained models can been extended to incorporate additional languages without loss of performance. Expand
...
1
2
3
4
5
...