• Corpus ID: 237355019

BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding

  title={BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
  author={Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
In this paper, we introduce “Embedding Barrier”, a phenomenon that limits the monolingual performance of multilingual models on low-resource languages having unique typologies. We build ‘BanglaBERT’, a Bangla language model pretrained on 18.6 GB Internetcrawled data and benchmark on five standard NLU tasks. We discover a significant drop in the performance of the stateof-the-art multilingual model (XLM-R) from BanglaBERT and attribute this to the Embedding Barrier through comprehensive… 
1 Citations
A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language Models
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity


iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA.
Banner: A Cost-Sensitive Contextualized Model for Bangla Named Entity Recognition
This paper proposes multiple BERT-based deep learning models that use the contextualized embedding from BERT as inputs and a simple statistical approach for class weight cost sensitive learning.
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
This work introduces the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia, and demonstrates that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT.
Unsupervised Cross-lingual Representation Learning at Scale
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
Emerging Cross-lingual Structure in Pretrained Language Models
It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.
XNLI: Evaluating Cross-lingual Sentence Representations
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines.
On Romanization for Model Transfer Between Scripts in Neural Machine Translation
The results show that romanization entails information loss and is thus not always superior to simpler vocabulary transfer methods, but can improve the transfer between related languages with different scripts.
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
This work builds a customized sentence segmenter for Bengali and proposes two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering, which will pave the way for future research on Bengali-English machine translation as well as other low- resource languages.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Improving Multilingual Models with Language-Clustered Vocabularies
This work introduces a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocABularies.