A Large-Scale Study of Machine Translation in Turkic Languages

  title={A Large-Scale Study of Machine Translation in Turkic Languages},
  author={Jamshidbek Mirzakhalov and Anoop S. Babu and Duygu Ataman and Sherzod Kariev and Francis M. Tyers and Otabek Abduraufov and Mammad Hajili and Sardana Ivanova and Abror Khaytbaev and Antonio Laverghetta and Behzodbek Moydinboyev and Esra Onal and Shaxnoza Pulatova and Ahsan Wahab and Orhan Firat and Sriram Chellappan},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely… 

Figures and Tables from this paper

Evaluating Multiway Multilingual NMT in the Turkic Languages

It is found that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair results in a huge performance boost in both low- and high-resource scenarios.

Towards Effective Machine Translation For a Low-Resource Agglutinative Language: Karachay-Balkar

This thesis finds that the Apertium is more conducive to a community driven machine translation development process than JoeyNMT when evaluated on the criteria of efficiency, accessibility, ease of deployment, and interpretability.

Building Machine Translation Systems for the Next Thousand Languages

Results in three research domains are described, which include building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-drivenData-driven language identification techniques and developing practical MT models for under-served languages.

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

This work creates Glot500-m, an LLM that covers 511 predominantly low-resource languages and shows that no single factor explains the quality of multilingual LLM representations, rather a combination of factors determines quality including corpus size, script,"help" from related languages and the total capacity of the model.

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

Two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer to multilingual and cross-lingsual retrieval tasks.

Quantifying Synthesis and Fusion and their Impact on Machine Translation

Theoretical work in morphological typology offers the possibility of measuring morphological diversity on a continuous scale. However, literature in Natural Language Processing (NLP) typically labels

An Open Dataset and Model for Language Identification

This work presents a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of0.033 across 201 languages, outperforming previous work.

What a Creole Wants, What a Creole Needs

In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to

Local Languages, Third Spaces, and other High-Resource Scenarios

The world’s language ecology includes standardised languages, local languages, and contact languages, which are often subsumed under the label of “under-resourced languages" even though they have distinct functions and prospects.

Machine translation infrastructure for turkic languages (MT-Turk)

Although the lack of linguistic resources affected the success of the system negatively, this study led to the introduction of an extensible infrastructure that can learn from previous translations and using the suggestions of previous users for disambiguation.

The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English

This work introduces the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia, and demonstrates that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT.

The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT

A new benchmark for machine translation that provides training and test data for thousands of language pairs covering over 500 languages and tools for creating state-of-the-art translation models from that collection is described to trigger the development of open translation tools and models with a much broader coverage of the World’s languages.

Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

A new vocabulary reduction method for NMT, which can reduce the vocabulary of a given input corpus at any rate while also considering the morphological properties of the language, based on unsupervised morphology learning.

It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information

This paper proposes cross-mutual information (XMI): an asymmetric information-theoretic metric of machine translation difficulty that exploits the probabilistic nature of most neural machine translation models.

Building the Tatar-Russian NMT System Based on Re-translation of Multilingual Data

The main results are the creation of the first neural Tatar-Russian translation system and the improvement of the translation quality in this language pair in terms of BLEU scores from 12 to 39 and from 17 to 45 for both translation directions.

Neural Machine Translation of Rare Words with Subword Units

This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages

The shared task experience suggests that back-translation and domain adaptation methods result in better accuracy for small-size datasets, and that, although translation between similar languages is no cakewalk, linguistically distinct languages require more data to give better results.

Application of Low-resource Machine Translation Techniques to Russian-Tatar Language Pair

This paper applies such techniques as transfer learning and semi-supervised learning to the base Transformer model and empirically shows that the resulting models improve Russian to Tatar and Tatar to Russian translation quality by +2.57 and +3.66 BLEU, respectively.

COMET: A Neural Framework for MT Evaluation

This framework leverages recent breakthroughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality.