DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

  title={DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text},
  author={Bharathi Raja Chakravarthi and Ruba Priyadharshini and Vigneshwaran Muralidaran and Navya Jose and Shardul Suryawanshi and Elizabeth Sherly and John P. McCrae},
  journal={Language Resources and Evaluation},
  pages={765 - 806}
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by… 

Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling

This work intends to classify code-mixed social media comments/posts in the Dravidian languages of Tamil, Kannada, andMalayalam to improve offensive language identification by generating pseudo-labels on the dataset.

An Ensemble Model for Sentiment Classification on Code-Mixed Data in Dravidian Languages

An ensemble sentiment classification strategy based on majority voting using 13 different classification models on the Dravidian code-mixed languages dataset provided in FIRE 2021 was applied, finding that the ensemble of multiple classifiers outperformed others for sentiment classification.

An ensemble-based model for sentiment analysis of Dravidian code-mixed social media posts

An ensemble-based model to classify Kannada-English, Malayalam- English, and Tamil-English social media postings into five different sentiment classes using character-level TF-IDF features as input is presented.

Sentiment Analysis on Dravidian Code-Mixed YouTube Comments using Paraphrase XLM-RoBERTa Model

This work uses the Paraphrase XLM-RoBERTa model to solve the sentiment classification problem of YouTube comments in code-mixed language, and ranks first, second and third on Tamil, Malayalam, and Kannada code-Mixed language datasets.

Offensive Language Classification of Code-Mixed Tamil with Keras

This paper presents the method adopted for completing Task 1 of Dravidian-CodeMix-HASOC (Hate Speech and Offensive Content Identification in English and Indo-European Languages) Shared Task proposed

Transformer based Sentiment Analysis in Dravidian Languages

A soft voting classifier is proposed with the help of other fine-tuned multilingual language models, achieving the best weighted F1-Score of 0.752, 0.619, and 0.648 in Malayalam, Tamil, and Kannada respectively.

Findings of Shared Task on Offensive Language Identification in Tamil and Malayalam

Those benchmark systems are analysed to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam.

Benchmarking Multi-Task Learning for Sentiment Analysis and Offensive Language Identification in Under-Resourced Dravidian Languages

Analysis of fine-tuned models indicates the preference of multi-task learning over single- task learning resulting in a higher weighted F1-score on all three languages, including Kannada, Malayalam and Tamil.

Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam

Those benchmark systems are analysed to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam.

TamilEmo: Finegrained Emotion Detection Dataset for Tamil

This labeled dataset (a largest manually annotated dataset of more than 42k Tamil YouTube comments, labeled for 31 emotions including neutral) for emotion recognition is introduced to improve emotion detection in multiple downstream tasks in Tamil.



Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text

The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language.

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

A new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators is presented, which obtained a Krippendorff’s alpha above 0.8 for the dataset.

Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada

A shared task on offensive language detection in Dravidian languages is created and an overview of the methods and the results of the competing systems are presented.

Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation

This paper created Kannada-English code mixed corpus by crawling Facebook comments and used sentiment analysis code-mixed corpus provided by Sentiment Analysis for Indian Languages (SAIL)-2017 which includes Bengali-English and Hindi-English languages.

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.

Overview of the track on HASOC-Offensive Language Identification-DravidianCodeMix

The results and main findings of the HASOC-Offensive Language Identification on code mixed Dravidian languages and the system submission and methods used by participants are presented.

Code Mixing: A Challenge for Language Identification in the Language of Social Media

A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish.

Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation

Basic features such as word count and punctuation count are used in addition to traditional features including Bag of Words and Term Frequency-Inverse Document Frequency included to check their influence in the prediction.

Predicting the Type and Target of Offensive Posts in Social Media

The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.