Corpus ID: 235458232

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

  title={DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text},
  author={Bharathi Raja Chakravarthi and Ruba Priyadharshini and Vigneshwaran Muralidaran and Navya Jose and Shardul Suryawanshi and Elizabeth Sherly and John P. McCrae},
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by… Expand
Benchmarking Multi-Task Learning for Sentiment Analysis and Offensive Language Identification in Under-Resourced Dravidian Languages
Analysis of fine-tuned models indicates the preference of multi-task learning over single- task learning resulting in a higher weighted F1-score on all three languages, including Kannada, Malayalam and Tamil. Expand
Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling
This work intends to classify code-mixed social media comments/posts in the Dravidian languages of Tamil, Kannada, andMalayalam to improve offensive language identification by generating pseudo-labels on the dataset. Expand
PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for Offensive Language Identification in Tanglish
This task aims to identify offensive content in code-mixed comments/posts in Dravidian Languages collected from social media and utilizes pooling the last layers of pretrained transformer multilingual BERT for this task which helped it achieve rank nine on the leaderboard. Expand


Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text
The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language. Expand
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
A new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators is presented, which obtained a Krippendorff’s alpha above 0.8 for the dataset. Expand
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil,Expand
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown. Expand
Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation
This paper created Kannada-English code mixed corpus by crawling Facebook comments and used sentiment analysis code-mixed corpus provided by Sentiment Analysis for Indian Languages (SAIL)-2017 which includes Bengali-English and Hindi-English languages. Expand
An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text
  • B. S. Sowmya Lakshmi, B. Shambhavi
  • Computer Science
  • 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS)
  • 2017
This work focused on the problem of word-level LID for code-mixed data, which contains English and Kannada code mixed sentences from social media posts and experiments on various supervised classifiers. Expand
Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation
Basic features such as word count and punctuation count are used in addition to traditional features including Bag of Words and Term Frequency-Inverse Document Frequency included to check their influence in the prediction. Expand
Code Mixing: A Challenge for Language Identification in the Language of Social Media
A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration. Expand
Predicting the Type and Target of Offensive Posts in Social Media
The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available. Expand
Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German
This paper presents the HASOC track and its two parts, creating test collections for languages with few resources and English for comparison, and presents the tasks, the data and the main results. Expand