Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine Tuned Multilingual Embeddings

@inproceedings{De2021CoarseAF,
  title={Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine Tuned Multilingual Embeddings},
  author={Arkadipta De and E Venkatesh and Kaushal Kumar Maurya and Maunendra Sankar Desarkar},
  booktitle={CONSTRAINT@AAAI},
  year={2021}
}
Due to the wide adoption of social media platforms like Facebook, Twitter, etc., there is an emerging need of detecting online posts that can go against the community acceptance standards. The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindi due to the unavailability of large suitable data. We view this hostility detection as a multi-label multi-class classification problem. We propose an… 
Hostility Detection in Online Hindi-English Code-Mixed Conversations
TLDR
This paper proposes a novel hierarchical neural network architecture to identify hostile posts/comments/replies in online Hindi-English Code-Mixed conversations and leverages large multilingual pre-trained (mLPT) models like mBERT, XLMR, and MuRIL to do so.
Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts
TLDR
The findings of the shared tasks conducted at the CONSTRAINT Workshop at AAAI 2021 are presented and the most successful models were BERT or its variations.

References

SHOWING 1-10 OF 21 REFERENCES
Detecting Offensive Tweets in Hindi-English Code-Switched Language
TLDR
A novel tweet dataset, titled Hindi- English Offensive Tweet (HEOT) dataset, consisting of tweets in Hindi-English code switched language split into three classes: non-offensive, abusive and hate-speech is introduced.
BanFakeNews: A Dataset for Detecting Fake News in Bangla
TLDR
An annotated dataset of ≈ 50K news is proposed that can be used for building automated fake news detection systems for a low resource language like Bangla and a benchmark system with state of the art NLP techniques to identify Bangla fake news is developed.
Hostility Detection Dataset in Hindi
TLDR
A novel hostility detection dataset in Hindi language collected and manually annotate ~8200 online posts that covers four hostility dimensions: fake news, hate speech, offensive, and defamation posts, along with a non-hostile label.
XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation
TLDR
A recent cross-lingual pre-trained model Unicoder is extended to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline and the base versions of Multilingual BERT, XLM and XLM-R are evaluated for comparison.
Unsupervised Cross-lingual Representation Learning at Scale
TLDR
It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization
TLDR
The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.
Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts
TLDR
The findings of the shared tasks conducted at the CONSTRAINT Workshop at AAAI 2021 are presented and the most successful models were BERT or its variations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection
TLDR
This work presents a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter and proposes a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.
Arabic Offensive Language Detection with Attention-based Deep Neural Networks
TLDR
This paper proposed the methods for data preprocessing and balancing, and then the Convolutional Neural Network and bidirectional Gated Recurrent Unit models used were presented and augmented with attention layer.
...
...