Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection

  title={Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection},
  author={Omkar Gokhale and Aditya Kane and Shantanu Patankar and Tanmay Chavan and Raviraj Joshi},
Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper, we study the effects of hateful pre-training on low-resource hate speech classification tasks. While previous studies on the English language have emphasized its importance, we aim to augment their observations with some non-obvious… 

Figures and Tables from this paper

A Twitter BERT Approach for Offensive Language Detection in Marathi

The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASoc 2022 test set.

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

It is shown that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi, and the vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy.

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable of identifying such content automatically.



A Review of Challenges in Machine Learning based Automated Hate Speech Detection

This work deeply explore a wide range of challenges in automatic hate speech detection by presenting a hierarchical organization of these problems by focusing on challenges faced by machine learning or deep learning based solutions to hate speech identification.

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

It is shown that Marathi monolingual models outperform the multilingual BERT variants in five different downstream downstreamtuning experiments and that monolingUAL MahaBERT-based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.

L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models

This work presents L3Cube-MahaHate, the first major Hate Speech Dataset in Marathi, and explores mono-lingual and multilingual variants of BERT like MahaBERT, IndicBERt, mberT, and xlm-RoBERTa and shows that mono-lingsual models perform bet-ter than their multi-lingUAL counterparts.

Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

In recent years, the spread of online offensive content has become of great concern, motivating researchers to develop robust systems capable of identifying such content automatically, and the HASOC (Hate Speech and Offensive Content Identification) shared task is one of these initiatives.

HateBERT: Retraining BERT for Abusive Language Detection in English

HateBERT, a re-trained BERT model for abusive language detection in English, is introduced and a battery of experiments comparing the portability of the fine-tuned models across the datasets are discussed, suggesting that portability is affected by compatibility of the annotated phenomena.

Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review

The basic baseline components of hate speech classification using ML algorithms were examined and the different variants of ML techniques were reviewed which include classical ML, ensemble approach and deep learning methods.

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

This work creates the largest available dataset for this task, SOLID, which contains over nine million English tweets labeled in a semi-supervised manner, and demonstrates experimentally that using SOLID along with OLID yields improved performance on the OLID test set for two different models, especially for the lower levels of the taxonomy.

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.

Deep Learning for Hate Speech Detection in Tweets

These experiments on a benchmark dataset of 16K annotated tweets show that such deep learning methods outperform state-of-the-art char/word n-gram methods by ~18 F1 points.

L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library

The L3Cube-MahaNLP aims to build resources and a library for Marathi natural language processing and presents datasets and models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.