• Corpus ID: 227230540

KanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language Detection

  title={KanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language Detection},
  author={Adeep Hande and Ruba Priyadharshini and Bharathi Raja Chakravarthi},
We introduce Kannada CodeMixed Dataset (KanCMD), a multi-task learning dataset for sentiment analysis and offensive language identification. The KanCMD dataset highlights two realworld issues from the social media text. First, it contains actual comments in code mixed text posted by users on YouTube social media, rather than in monolingual text from the textbook. Second, it has been annotated for two tasks, namely sentiment analysis and offensive language detection for under-resourced Kannada… 

Figures and Tables from this paper

Benchmarking Multi-Task Learning for Sentiment Analysis and Offensive Language Identification in Under-Resourced Dravidian Languages
Analysis of fine-tuned models indicates the preference of multi-task learning over single- task learning resulting in a higher weighted F1-score on all three languages, including Kannada, Malayalam and Tamil.
Offensive language identification in Dravidian code mixed social media text
The experimental results showed that 1 to 6-gram character TF-IDF features are better for the said task and the best performing models were naive bayes, logistic regression, and vanilla neural network for the dataset Tamil code-mix, Malayalam code-mixed, and Malayali script-m mixed.
Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts
A novel and flexible approach of selective translation and transliteration techniques are proposed to reap better results from fine-tuning and ensembling multilingual transformer networks like BERT, DistilBERT, and XLM-RoBERTa and are promising for effective offensive speech identification in low-resourced languages.
IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages
The work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 is described by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset, transliteration (TRAI), English translations (TRAA) of the TRAI data and the combination of all the three.
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
A shared task on offensive language detection in Dravidian languages is created and an overview of the methods and the results of the competing systems are presented.
Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language Tags
This work takes up a similar challenge of developing a sentiment analysis model that can work with English-Tamil code-mixed data by using bi-directional LSTMs along with language tagging using a Neural Network based model.
Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling
This work intends to classify code-mixed social media comments/posts in the Dravidian languages of Tamil, Kannada, andMalayalam to improve offensive language identification by generating pseudo-labels on the dataset.
HUB@DravidianLangTech-EACL2021: Identify and Classify Offensive Text in Multilingual Code Mixing in Social Media
This is the first task to detect offensive comments posted in social media comments in the Dravidian language and uses the multilingual BERT model to complete this task.
DOSA: Dravidian Code-Mixed Offensive Span Identification Dataset
The Dravidian Offensive Span Identification Dataset (DOSA) is presented, which provides span annotations for Tamil-English and Kannada-English code-mixed comments posted by users on YouTube social media, leading to an essential step towards semi-automated content moderation in Dravid languages.
CUSATNLP@DravidianLangTech-EACL2021:Language Agnostic Classification of Offensive Content in Tweets
This shared task is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil using language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier.


Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
A gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube is created and inter-annotator agreement is presented, and the results of sentiment analysis trained on this corpus are shown.
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection
This work presents a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter and proposes a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.
Sentiment Analysis for Code-Mixed Indian Social Media Text With Distributed Representation
This paper created Kannada-English code mixed corpus by crawling Facebook comments and used sentiment analysis code-mixed corpus provided by Sentiment Analysis for Indian Languages (SAIL)-2017 which includes Bengali-English and Hindi-English languages.
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
A new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators is presented, which obtained a Krippendorff’s alpha above 0.8 for the dataset.
Detecting stance in kannada social media code-mixed text using sentence embedding
For the first time, stance detection system implemented for Indian language Kannada code-mix text comments extracted from popular social media site Facebook is presented, emphasized on trending local and national current issues in Karnataka geographic region.
Comparison of Pretrained Embeddings to Identify Hate Speech in Indian Code-Mixed Text
This paper compares pretrained models and creates an ensemble model for code-mixed data of hate speech classification task on Hindi-English data to show that XLNet performs better for hate speech detection in code-Mixed text.
An Automatic Language Identification System for Code-Mixed English-Kannada Social Media Text
  • B. S. Sowmya Lakshmi, B. Shambhavi
  • Computer Science
    2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS)
  • 2017
This work focused on the problem of word-level LID for code-mixed data, which contains English and Kannada code mixed sentences from social media posts and experiments on various supervised classifiers.
Offensive Language Identification in Greek
OGTD is a manually annotated dataset containing 4,779 posts from Twitter annotated as offensive and not offensive, and is evaluated by several computational models trained and tested on this data.
Predicting the Type and Target of Offensive Posts in Social Media
The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.