Corpus ID: 235458232

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

@article{Chakravarthi2021DravidianCodeMixSA,
  title={DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text},
  author={Bharathi Raja Chakravarthi and R. Priyadharshini and V. Muralidaran and Navya Jose and Shardul Suryawanshi and E. Sherly and John P. McCrae},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.09460}
}
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by… Expand
Benchmarking Multi-Task Learning for Sentiment Analysis and Offensive Language Identification in Under-Resourced Dravidian Languages
TLDR
Analysis of fine-tuned models indicates the preference of multi-task learning over single- task learning resulting in a higher weighted F1-score on all three languages, including Kannada, Malayalam and Tamil. Expand
Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling
Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is anExpand

References

SHOWING 1-10 OF 79 REFERENCES
Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text
TLDR
The participants were given a dataset of YouTube comments and the goal of the shared task submissions was to recognise the sentiment of each comment by classifying them into positive, negative, neutral, mixed-feeling classes or by recognising whether the comment is not in the intended language. Expand
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
TLDR
A new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators is presented, which obtained a Krippendorff’s alpha above 0.8 for the dataset. Expand
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil,Expand
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
TLDR
A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown. Expand
Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation
TLDR
This paper created Kannada-English code mixed corpus by crawling Facebook comments and used sentiment analysis code-mixed corpus provided by Sentiment Analysis for Indian Languages (SAIL)-2017 which includes Bengali-English and Hindi-English languages. Expand
An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text
  • B. S. Sowmya Lakshmi, B. Shambhavi
  • Computer Science
  • 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS)
  • 2017
TLDR
This work focused on the problem of word-level LID for code-mixed data, which contains English and Kannada code mixed sentences from social media posts and experiments on various supervised classifiers. Expand
Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation
TLDR
Basic features such as word count and punctuation count are used in addition to traditional features including Bag of Words and Term Frequency-Inverse Document Frequency included to check their influence in the prediction. Expand
Code Mixing: A Challenge for Language Identification in the Language of Social Media
TLDR
A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration. Expand
Predicting the Type and Target of Offensive Posts in Social Media
TLDR
The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available. Expand
Overview of the HASOC Track at FIRE 2020: Hate Speech and Offensive Language Identification in Tamil, Malayalam, Hindi, English and German
TLDR
This paper presents the HASOC track and its two parts, creating test collections for languages with few resources and English for comparison, and presents the tasks, the data and the main results. Expand
...
1
2
3
4
5
...