• Corpus ID: 232478417

Domain-specific MT for Low-resource Languages: The case of Bambara-French

  title={Domain-specific MT for Low-resource Languages: The case of Bambara-French},
  author={Allahsera Auguste Tapo and Michael Leventhal and Sarah K. K. Luger and Christopher Michael Homan and Marcos Zampieri},
Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data. In this paper we address the issue of domainspecific MT for Bambara, an under-resourced Mande language spoken in Mali. We present the first domain-specific parallel dataset for MT of Bambara into and from French. We discuss challenges in working with small quantities of domain-specific data for a low-resource language and we present the results of machine learning… 

Figures and Tables from this paper

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
The Flores-101 evaluation benchmark is introduced, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains that enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems.
Building Machine Translation Systems for the Next Thousand Languages
Results in three research domains are described, which include building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-drivenData-driven language identification techniques and developing practical MT models for under-served languages.
The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation
It is suggested that sparsity can play a beneficial role at curbing memorization of low frequency attributes, and therefore offers a promising so-called double bind to the low-resource double bind.


Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study
Novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages are presented and it is suggested that similar quality can be obtained from either written or spoken translations for certain kinds of texts.
Towards a dependency-annotated treebank for Bambara
A dependency annotation scheme for Bambara, a Mande language spoken in Mali, which has few computational linguistic resources, and the annotation of a small treebank of 116 sample sentences, which were picked randomly.
On Optimal Transformer Depth for Low-Resource Language Translation
It is found that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations.
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
The complexity of the vocabulary of Bambara
The weak generative capacity of the vocabulary of Bambara is studied, and it is shown that the vocabulary is not context free.
BPE-Dropout: Simple and Effective Subword Regularization
BPE-dropout is introduced - simple and effective subword regularization method based on and compatible with conventional BPE that stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework.
Joey NMT: A Minimalist NMT Toolkit for Novices
Joey NMT provides many popular NMT features in a small and simple code base, so that novices can easily and quickly learn to use it and adapt it to their needs, and achieves performance comparable to more complex toolkits on standard benchmarks.
Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara
The first parallel data set for machine translation of Bambara into and from English and French and the first benchmark results on machine translation to and from B Ambara are presented.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any