Corpus ID: 213176734

Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent

@article{Kunchukuttan2020UtilizingLR,
  title={Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent},
  author={Anoop Kunchukuttan and Pushpak Bhattacharyya},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.08925}
}
In this work, we present an extensive study of statistical machine translation involving languages of the Indian subcontinent. These languages are related by genetic and contact relationships. We describe the similarities between Indic languages arising from these relationships. We explore how lexical and orthographic similarity among these languages can be utilized to improve translation quality between Indic languages when limited parallel corpora is available. We also explore how the… Expand
A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages
TLDR
A corpus of 600K word pairs mined from parallel translation corpora and monolingual corpora is created, which is the largest transliteration corpora for Indian languages mined from public sources and proposes an improved multilingual training recipe for Indic languages. Expand
Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
TLDR
It is argued that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs, and proposed RelateLM, which uses transliteration to convert the unseen script of limited LRL text into the script of a Related Prominent Language (RPL) (Hindi in this case). Expand
Language Relatedness and Lexical Closeness can help Improve Multilingual NMT: IITBombay@MultiIndicNMT WAT2021
TLDR
The use of transliteration (script conversion) for Indic languages in reducing the lexical gap for training a multilingual NMT system is demonstrated and improvement in performance is shown by training aMultilingual N MT system using languages of the same family, i.e., related languages. Expand
IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages
TLDR
The analyses focus on identifying the impact of script unification, corpora size as well as multilingualism on the final performance of IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. Expand
Neural Machine Translation in Low-Resource Setting: a Case Study in English-Marathi Pair
TLDR
Different techniques of overcoming the challenges of low-resource in Neural Machine Translation (NMT) are explored, focusing on the case of English-Marathi NMT and a significant improvement trend in BLEU score is observed across various techniques. Expand
IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages
TLDR
This paper describes the work and the systems submitted by the IIIT-Hyderbad team in the WAT 2021 MultiIndicMT shared task, which covers 10 major languages of the Indian subcontinent and finds that the final multilingual system significantly outperforms the baselines. Expand
Multilingual Machine Translation Systems at WAT 2021: One-to-Many and Many-to-One Transformer based NMT
TLDR
From the authors' experiments, it is observed that the multilingual NMT systems outperforms the bilingual baseline MT systems for each of the language pairs under consideration. Expand
Contact Relatedness can help improve multilingual NMT: Microsoft STCI-MT @ WMT20
TLDR
It is shown utilizing contact relatedness via multilingual NMT can significantly improve translation quality for English-Tamil translation. Expand
Alternative Input Signals Ease Transfer in Multilingual Machine Translation
Recent work in multilingual machine translation (MMT) has focused on the potential of positive transfer between languages, particularly cases where higher-resourced languages can benefitExpand
...
1
2
...

References

SHOWING 1-10 OF 55 REFERENCES
Shata-Anuvadak: Tackling Multiway Translation of Indian Languages
We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families. We analyze theExpand
Statistical Machine Translation between Related Languages
TLDR
The objective of the tutorial is to discuss how the relatedness among languages can be leveraged to bridge this language divergence thereby achieving some/all of these goals: improving translation quality, achieving better generalization, sharing linguistic resources, and reducing resource requirements. Expand
ANGLABHARTI: a multilingual machine aided translation project on translation from English to Indian languages
TLDR
An English to Indian languages machine aided translation system, named ANGLABHARTI, has been developed, which is better than the transfer approach, but falls short of genuine interlingua, in the sense that it ignores complete disambiguation/understanding of the text to be translated. Expand
Morphological Processing for English-Tamil Statistical Machine Translation
TLDR
This work implements suffix-separation rules for both of the English-Tamil language pair, and evaluates the impact of this preprocessing on translation quality of the phrase-based as well as hierarchical model in terms of BLEU score and a small manual evaluation. Expand
Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
TLDR
A collection of parallel corpora between English and six languages from the Indian subcontinent, which are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation research are built. Expand
Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent
TLDR
Brahmi-Net is presented - an online system for transliteration and script conversion for all major Indian language pairs (306 pairs) and an extended ITRANS encoding for translating between English and Indic scripts. Expand
Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation
TLDR
The approach eschews the use of parsing or other sophisticated linguistic tools for the target language (Hindi) making it a useful framework for statistical machine translation from English to Indian languages in general, since such tools are not widely available for Indian languages currently. Expand
Character-Based PSMT for Closely Related Languages
TLDR
This paper describes a simple method to combine character- based models with standard word-based models to increase the coverage of a phrase-based SMT system and can show a modest improvement when translating between Norwegian and Swedish. Expand
Clause Restructuring for Statistical Machine Translation
TLDR
The reordering approach is applied as a pre-processing step in both the training and decoding phases of a phrase-based statistical MT system, showing an improvement from 25.2% Bleu score for a baseline system to 26.8% Blee score for the system with reordering. Expand
Source Language Adaptation Approaches for Resource-Poor Machine Translation
TLDR
Three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation are proposed by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. Expand
...
1
2
3
4
5
...