iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

  title={iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages},
  author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
In this paper, we introduce NLP resources for 11 major Indian languages from two major language families. These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark). The monolingual corpora contains a total of 8.8 billion tokens across all 11 languages and Indian English, primarily sourced from news crawls. The word embeddings are based on FastText… 

IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

In this paper, we present the IndicNLG suite, a collection of datasets for benchmarking Natural Language Generation (NLG) for 11 Indic languages. We focus on five diverse tasks, namely, biography

Comparative Analysis of Cross-lingual Contextualized Word Embeddings

This paper compares five multilingual and seven monolingual language models and investigates the effect of various aspects on their performance, such as vocabulary size, number of languages used for training and number of parameters.

MuCoT: Multilingual Contrastive Training for Question-Answering in Low-resource Languages

Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance, whereas the performance degrades in the case of cross-language families.

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

This work proposes an unsupervised approach to improve the cross-lingual representations of low-resource languages by bootstrapping word translation pairs from monolingual corpora and using them to improve language alignment in pre-trained language models.

Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for four different NLP Tasks in Telugu Language

These representations significantly improve the performance of four NLP tasks and present the benchmark results for Telugu, and argue that the pretrained embeddings are competitive or better than the existing multilingual pretrained models: mBERT, XLM-R, and IndicBERT.

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

This work compares the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch, and empirically argues against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection.

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

The Embedding Barrier is introduced, a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies and a straightforward solution by transcribing languages to a common script is proposed, which can effectively improve the performance of a multilingual model for the Bangla language.

Cross-lingual Few-Shot Learning on Unseen Languages

This paper uses a downstream sentiment analysis task across 12 languages, including 8 unseen languages, to analyze the effectiveness of several few-shot learning strategies across the three major types of model architectures and their learning dynamics and shows that taking the context from a mixture of random source languages is surprisingly more effective.

IndicBART: A Pre-trained Model for Indic Natural Language Generation

The authors' experiments show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller, and performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning.

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

The largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families is presented and the utility of the obtained dataset on existing test-sets and the Naamapadam-test data for 8 Indic languages is demonstrated.



FlauBERT: Unsupervised Language Model Pre-training for French

This paper introduces and shares FlauBERT, a model learned on a very large and heterogeneous French corpus and applies it to diverse NLP tasks and shows that most of the time they outperform other pre-training approaches.

CLUE: A Chinese Language Understanding Evaluation Benchmark

The first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark is introduced, an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

Word Translation Without Parallel Data

It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

A Multilingual Parallel Corpora Collection Effort for Indian Languages

The methods of constructing sentence aligned parallel corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods are reported on.

How Multilingual is Multilingual BERT?

It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.

Polyglot: Distributed Word Representations for Multilingual NLP

This work quantitatively demonstrates the utility of word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages and investigates the semantic features captured through the proximity of word groupings.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.

Learning Word Vectors for 157 Languages

This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors.