• Corpus ID: 236493331

Detecting Abusive Albanian

  title={Detecting Abusive Albanian},
  author={Erida Nurce and Jorgel Keci and Leon Derczynski},
The ever growing usage of social media in the recent years has had a direct impact on the increased presence of hate speech and offensive speech in online platforms. Research on effective detection of such content has mainly fo-cused on English and a few other widespread languages, while the leftover majority fail to have the same work put into them and thus cannot benefit from the steady advancements made in the field. In this paper we present Shaj , an annotated Albanian dataset for hate… 

Tables from this paper

Hate Speech Classification in Bulgarian

This work aggregated a real-world dataset from Bulgarian online forums and manually annotated 108,142 sentences and developed and evaluated various classifiers on the dataset and found that a support vector machine with a linear kernel trained on character-level TF-IDF features is the best model.



Offensive Language and Hate Speech Detection for Danish

This work constructs a Danish dataset DKhate containing user-generated comments from various social media platforms, and to the authors' knowledge, the first of its kind, annotated for various types and target of offensive language, and develops four automatic classification systems designed to work for both the English and the Danish language.

Hate Speech Dataset from a White Supremacy Forum

A custom annotation tool has been developed to carry out the manual labelling task which, among other things, allows the annotators to choose whether to read the context of a sentence before labelling it.

A Corpus of Turkish Offensive Language on Social Media

Annotation guidelines are based on a careful review of the annotation practices of recent efforts for other languages, and results of automatically classifying the corpus using state-of-the-art text classification methods are presented.

Automated Hate Speech Detection and the Problem of Offensive Language

This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither.

Social Network Hate Speech Detection for Amharic Language

An apache spark based model to classify Amharic Facebook posts and comments into hate and not hate is developed and achieves a promising result with unique feature of spark for big data.

Detecting Hate Speech on the World Wide Web

The definition of hate speech, the collection and annotation of the hate speech corpus, and a mechanism for detecting some commonly used methods of evading common "dirty word" filters are described.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

The results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval), based on a new dataset, contain over 14,000 English tweets, are presented.

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

A list of criteria founded in critical race theory is provided, and these are used to annotate a publicly available corpus of more than 16k tweets and present a dictionary based the most indicative words in the data.

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

It is found that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems training on expert annotations outperform systems trained on amateur annotations.

Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter at SemEval-2019 Task 5: Frequency Analysis Interpolation for Hate in Speech Detection

This document describes a text change of representation approach to the task of Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter, as part of SemEval-2019 1 . The task is