Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

@inproceedings{Nekoto2020ParticipatoryRF,
  title={Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages},
  author={Wilhelmina Nekoto and V. Marivate and T. Matsila and Timi E. Fasubaa and T. Kolawole and T. Fagbohungbe and S. Akinola and S. Muhammad and Salomon Kabongo KABENAMUALU and Salomey Osei and Sackey Freshia and Rubungo Andre Niyongabo and Ricky Macharm and Perez Ogayo and Orevaoghene Ahia and Musie Meressa and Mofetoluwa Adeyemi and Masabata Mokgesi-Selinga and Lawrence Okegbemi and L. Martinus and Kolawole Tajudeen and Kevin Degila and Kelechi Ogueji and Kathleen Siminyu and Julia Kreutzer and Jason Webster and Jamiil Toure Ali and Jade Abbott and Iroro Orife and Ignatius Ezeani and Idris Abdulkabir Dangana and H. Kamper and Hady ElSahar and Goodness Duru and Ghollah Kioko and Espoir Murhabazi and Elan Van Biljon and Daniel Whitenack and Christopher Onyefuluchi and Chris C. Emezue and Bonaventure F. P. Dossou and Blessing Sibanda and B. Bassey and A. Olabiyi and Arshath Ramkilowan and A. Oktem and Adewale Akinfaderin and A. Bashir},
  booktitle={FINDINGS},
  year={2020}
}
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few… Expand

Figures and Tables from this paper

AI4D - African Language Program
TLDR
This work details the AI4D African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, and hosted competitive Machine Learning challenges on the basis of these datasets. Expand
Beyond English-Centric Multilingual Machine Translation
TLDR
This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Expand
1 Language Divergences and Typology
This chapter introduces machine translation (MT), the use of computers to transmachine translation MT late from one language to another. Of course translation, in its full generality, such as theExpand
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
TLDR
Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Expand
MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation
TLDR
This paper presents MENYO20k, the first multi-domain parallel corpus for the low-resource Yorùbá–English (yo–en) language pair with standardized train-test splits for benchmarking and provides several neural MT (NMT) benchmarks on this dataset, showing that, in almost all cases, the simple benchmarks outperform the pre-trained MT models. Expand
Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique
TLDR
The creation of the EmakhuwaPortuguese parallel corpus is described, which is a collection of texts from the Jehovah’s Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. Expand
Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources
TLDR
This work proposes a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. Expand
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
TLDR
GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics, is introduced and the description of the data for the 2021 shared task at the associated GEM Workshop is described. Expand
Transformer-based Machine Translation for Low-resourced Languages embedded with Language Identification
TLDR
The development of neural machine translation (NMT) for low-resourced languages of South Africa is presented and two MT models, JoeyNMT and transformer NMT with self-attention are trained and evaluated using BLEU score. Expand
KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi
TLDR
The experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi, and the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingsual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 71 REFERENCES
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
TLDR
This work sets a milestone by building a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples, and demonstrates effective transfer learning ability, significantly improving translation quality of low-resource languages, while keeping high-resource language translation quality on-par with competitive bilingual baselines. Expand
FFR V1.0: Fon-French Neural Machine Translation
TLDR
The creation of a large growing corpora for Fon-to-French translations and the FFR v1.0 model, trained on this dataset, is described, a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. Expand
On Optimal Transformer Depth for Low-Resource Language Translation
TLDR
It is found that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations. Expand
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
TLDR
This paper examines and analyzes the challenges associated with developing and introducing language technologies to low-resource language communities, and describes essential factors which the success of such technologies hinges upon. Expand
Universal Neural Machine Translation for Extremely Low Resource Languages
TLDR
The proposed approach utilizing a transfer-learning approach to share lexical and sentence level representations across multiple source languages into one target language is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences. Expand
A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION
We define a new, intuitive measure for evaluating machine translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judgments.Expand
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
TLDR
It is argued that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. Expand
Toward a Lightweight Solution for Less-resourced Languages: Creating a POS Tagger for Alsatian Using Voluntary Crowdsourcing
TLDR
The Bisame platform, a specifically-developed slightly gamified platform, was used for this purpose to gather annotations on a variety of corpora covering some of the language dialectal variations, enabling it to train a first tagger for Alsatian that is nearly 84% accurate. Expand
Towards Neural Machine Translation for Edoid Languages
TLDR
This work explores the feasibility of Neural Machine Translation (NMT) for the Edoid language family of Southern Nigeria and trained and evaluated baseline translation models for four widely spoken languages in this group: Edo, Esan, Urhobo and Isoko. Expand
A Call for Clarity in Reporting BLEU Scores
TLDR
Pointing to the success of the parsing community, it is suggested machine translation researchers settle upon the BLEU scheme, which does not allow for user-supplied reference processing, and provide a new tool, SACREBLEU, to facilitate this. Expand
...
1
2
3
4
5
...