Corpus ID: 218470462

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

@article{Kunchukuttan2020AI4BharatIndicNLPCM,
  title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
  author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and C. GokulN. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.00085}
}
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will… Expand
HinFlair: pre-trained contextual string embeddings for pos tagging and text classification in the Hindi language
TLDR
HinFlair is introduced, which is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus that outperforms previous state-of-the-art publicly available pre- trainedembeddings for downstream tasks like text classification and pos tagging. Expand
A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages
TLDR
A corpus of 600K word pairs mined from parallel translation corpora and monolingual corpora is created, which is the largest transliteration corpora for Indian languages mined from public sources and proposes an improved multilingual training recipe for Indic languages. Expand
BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding
TLDR
This work builds a Bangla natural language understanding model pre-trained on 18.6 GB data crawled from top Bangla sites on the internet, and identifies a major shortcoming of multilingual models that hurt performance for low-resource languages that don’t share writing scripts with any high resource one, named the ‘Embedding Barrier’. Expand
Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study
TLDR
It is shown that initialising the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches withembeddings randomly initialized. Expand
NICT-5’s Submission To WAT 2021: MBART Pre-training And In-Domain Fine Tuning For Indic Languages
TLDR
It is observed that a small amount of pre-training followed by fine-tuning on small bilingual corpora can yield large gains over when pre- training is not used, and multilingual fine- Tuning leads to further gains in translation quality which significantly outperforms a very strong multilingual baseline that does not rely on any pre- Training. Expand
BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding
TLDR
The Embedding Barrier is introduced, a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies and a straightforward solution by transcribing languages to a common script is proposed, which can effectively improve the performance of a multilingual model for the Bangla language. Expand
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages
TLDR
This work compares the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch, and empirically argues against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection. Expand
Sentiment Analysis Using XLM-R Transformer and Zero-shot Transfer Learning on Resource-poor Indian Language
TLDR
This research evaluates the performance of cross-lingual contextual word embeddings and zero-shot transfer learning in projecting predictions from resource-rich English to resource-poor Hindi language and gives an effective solution to sentence-level (tweet-level) analysis of sentiments in a resource- poor scenario. Expand
A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models
TLDR
A review of Bangla NLP tasks, resources, and tools available to the research community is provided; benchmark datasets collected from various platforms are benchmarked using current state-of-the-art algorithms (i.e., transformer-based models); and results show promising performance using transformer- based models while highlighting the trade-off with computational costs. Expand
Samsung R&D Institute Poland submission to WAT 2021 Indic Language Multilingual Task
TLDR
An effective approach to low-resource training that consist of pretraining on backtranslations and tuning on parallel corpora and two different domain-adaptation techniques which significantly improved translation quality when applied to monolingual corpora are applied. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 28 REFERENCES
Polyglot: Distributed Word Representations for Multilingual NLP
TLDR
This work quantitatively demonstrates the utility of word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages and investigates the semantic features captured through the proximity of word groupings. Expand
Learning Word Vectors for 157 Languages
TLDR
This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors. Expand
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals. Expand
Word Translation Without Parallel Data
TLDR
It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Expand
Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems
TLDR
This paper presents manually annotated monolingual word similarity datasets of six Indian languages - Urdu, Telugu, Marathi, Punjabi, Tamil and Gujarati, and presents baseline scores for word representation models using state-of-the-art techniques by evaluating them on newly created word similarity dataset. Expand
Enriching Word Vectors with Subword Information
TLDR
A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks. Expand
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages
TLDR
Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources, and the mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics. Expand
N-gram Counts and Language Models from the Common Crawl
TLDR
This release improves upon the Google n-gram counts in two key ways: the inclusion of low-count entries and deduplication to reduce boilerplate, and the use of Kneser-Ney smoothing to build large language models. Expand
Cross-Lingual Sentiment Analysis for Indian Languages using Linked WordNets
TLDR
The crux of the idea is to use the linked WordNets of two languages to bridge the language gap by using WordNet senses as features for supervised sentiment classification in Hindi and Marathi. Expand
ACTSA: Annotated Corpus for Telugu Sentiment Analysis
TLDR
An effort to build a gold-standard annotated corpus of Telugu sentences to support Telugu Sentiment Analysis is described, which makes the corpus the largest resource currently available. Expand
...
1
2
3
...