Evaluation of Deep Learning Models for Hostility Detection in Hindi Text

  title={Evaluation of Deep Learning Models for Hostility Detection in Hindi Text},
  author={Ramchandra Joshi and Rushabh Karnavat and Kaustubh Jirapure and Raviraj Joshi},
  journal={2021 6th International Conference for Convergence in Technology (I2CT)},
The social media platform is a convenient medium to express personal thoughts and share useful information. It is fast, concise, and has the ability to reach millions. It is an effective place to archive thoughts, share artistic content, receive feedback, promote products, etc. Despite having numerous advantages these platforms have given a boost to hostile posts. Hate speech and derogatory remarks are being posted for personal satisfaction or political gain. The hostile posts can have a… 

Figures and Tables from this paper

Text-Based Cyberbullying Prevention using Toxicity Filtering Mobile Chat Application and API

An attempt has been made to curb the cyberbullying on these social media platforms in textual form by providing an API (Application Programming Interface) that can receive an input text and respond with an annotation if the text is predicted to be offensive or not.

Unicode-based Data Processing for Text Classification

A Unicode based text data processing approach for machine learning classification that provides a simpler approach for text preprocessing that can maintain high accuracy.

Hostility Detection in Online Hindi-English Code-Mixed Conversations

This paper proposes a novel hierarchical neural network architecture to identify hostile posts/comments/replies in online Hindi-English Code-Mixed conversations and leverages large multilingual pre-trained (mLPT) models like mBERT, XLMR, and MuRIL to do so.

Marathi Social Media Opinion Mining using XLM-R

This project proposes use of XLM-RoBERTa (XLM-R) models that can be used for the opinion mining of the social media Marathi texts without using any translations.

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

It is shown that Marathi monolingual models outperform the multilingual BERT variants in five different downstream downstreamtuning experiments and that monolingUAL MahaBERT-based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.

L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models

This work presents L3Cube-MahaHate, the first major Hate Speech Dataset in Marathi, and explores mono-lingual and multilingual variants of BERT like MahaBERT, IndicBERt, mberT, and xlm-RoBERTa and shows that mono-lingsual models perform bet-ter than their multi-lingUAL counterparts.

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources is presented and MahaGPT, a generative Marathi GPT model trained on Marathi corpus is released.

Hierarchical Neural Network Approaches for Long Document Classification

This work employs pre-trained Universal Sentence Encoder and Bidirectional Encoder Representations from Transformers in a hierarchical setup to capture better representations efficiently and provides a comparison of different deep learning algorithms like USE, BERT, HAN, Longformer, and BigBird for long document classification.

Machine Learning Models for Hate Speech and Offensive Language Identification for Indo-Aryan Language: Hindi

An overview of the efforts to attempt to automatically detect abusive language on Twitter in English and Indo-Aryan Languages like Hindi using several machine learning models for Hindi Subtasks is presented.

Hate and Offensive Speech Detection in Hindi and Marathi

This work considers hate and offensive speech detection in Hindi and Marathi texts using deep learning architectures like CNN, LSTM, and variations of BERT like multilingual BERT, IndicBERT, and monolingual RoBERTa and shows that the transformer-based models perform the best and even the basic models along with FastText embeddings give a competitive performance.



AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

The IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families, is presented and it is shown that the IndiNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks.

Bag of Tricks for Efficient Text Classification

A simple and efficient baseline for text classification is explored that shows that the fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

Hostility Detection Dataset in Hindi

A novel hostility detection dataset in Hindi language collected and manually annotate ~8200 online posts that covers four hostility dimensions: fake news, hate speech, offensive, and defamation posts, along with a non-hostile label.

Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts

The findings of the shared tasks conducted at the CONSTRAINT Workshop at AAAI 2021 are presented and the most successful models were BERT or its variations.

DeepHate: Hate Speech Detection via Multi-Faceted Text Representations

DeepHate is a novel deep learning model that combines multi-faceted text representations such as word embeddings, sentiments, and topical information, to detect hate speech in online social platforms and outperforms the state-of-the-art baselines on the hate speech detection task.

Deep Learning for Hindi Text Classification: A Comparison

Translated versions of English data-sets are used to evaluate models based on CNN, LSTM and Attention for text classification of Hindi text and serve as a tutorial for popular text classification techniques.

Hate speech detection: Challenges and solutions

This work identifies and examines challenges faced by online automatic approaches for hate speech detection in text, and proposes a multi-view SVM approach that achieves near state-of-the-art performance, while being simpler and producing more easily interpretable decisions than neural methods.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection

This work presents a Hindi-English code-mixed dataset consisting of tweets posted online on Twitter and proposes a supervised classification system for detecting hate speech in the text using various character level, word level, and lexicon based features.