BERTifying Sinhala - A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification

  title={BERTifying Sinhala - A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification},
  author={Vinura Dhananjaya and Piyumal Demotte and Surangika Ranathunga and Sanath Jayasena},
  booktitle={International Conference on Language Resources and Evaluation},
This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre… 

Figures and Tables from this paper

Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World

Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed

Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification

This paper explores adapter-based fine-tuning of PMLMs for CMCS text classification and presents a newly annotated dataset for the classification of Sinhala–English code-mixed and code-switched text data, whereSinhala is a low-resourced language.

SOLD: Sinhala Offensive Language Dataset

The Sinhala Offensive Language Dataset ( SOLD) is introduced, a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models.



Sentiment Analysis of Sinhala News Comments

It is demonstrated that for low-resource languages such as Sinhala, the use of recently introduced word embedding models as semantic features can compensate for the lack of well-developed language-specific linguistic or language resources, and text classification with acceptable accuracy is indeed possible using both traditional statistical classifiers and Deep Learning models.

Sentiment Analysis for Sinhala Language using Deep Learning Techniques

This paper presents a much comprehensive study on the use of standard sequence models such as RNN, L STM, Bi-LSTM, as well as more recent state-of-the-art modelssuch as hierarchical attention hybrid neural networks, and capsule networks.

Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers.

Context aware stopwords for Sinhala Text classification

The seven stopword identification methods previously applied to other languages are presented to remove stopwords and a new algorithm for building a domain-specific stopword list is proposed.

Effectiveness of rule-based classifiers in Sinhala text categorization

This study is limited to rule based classifiers as they are humanly interpretable by nature, which gives an added advantage to text classification.

iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA.

Sentiment Analysis of Sinhala News Comments using Sentence-State LSTM Networks

A novel state-of-the-art deep learning technique called sentence state long short-term memory network is presented for Sinhala sentiment classification, which outperforms both the previously reported statistical machine learning algorithms and deep learning algorithms.

AraBERT: Transformer-based Model for Arabic Language Understanding

This paper pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language, and showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.

Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment

A weighting mechanism that makes use of available smallscale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment is presented.

Cross-lingual Language Model Pretraining

This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.