• Corpus ID: 233365150

To Block or not to Block: Experiments with Machine Learning for News Comment Moderation

  title={To Block or not to Block: Experiments with Machine Learning for News Comment Moderation},
  author={Damir Koren{\vc}i{\'c} and I. Baris Schlicht and Eugenia Fernandez and Katarina Leuschel and Eva Salido},
Today, news media organizations regularly engage with readers by enabling them to comment on news articles. This creates the need for comment moderation and removal of disallowed comments – a time-consuming task often performed by human moderators. In this paper we approach the problem of automatic news comment moderation as classification of comments into blocked and not blocked categories. We construct a novel dataset of annotated English comments, experiment with cross-lingual transfer of… 

Tables from this paper

(Notebook for PAN at CLEF 2021)
A unified user profiling framework to identify hate speech spreaders by processing their tweets regardless of the language and applies an attention mechanism to select important tweets for learning user profiles is presented.
UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims
Joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages and is proposed as an auxiliary task to mitigate unintended bias.
AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models
The main objective was to implement an accurate model to detect xenophobia in comments about web news articles within the DETOXIS shared task 2021, based on the competition’s official metrics.


Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom
A semi-automatic, holistic approach to comment moderation is proposed, which includes comment features but also their context, such as information about users and articles, for evaluation.
Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian
Initial work into the automatic classification of user-generated content in news media to support human moderators in two less-resourced European languages, Croatian and Estonian is described.
Deep Learning for User Comment Moderation
Experimenting with a new dataset of 1.6M user comments from a Greek news portal and existing datasets of EnglishWikipedia comments, we show that an RNN outperforms the previous state of the art in
Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus
This work presents a dataset and annotation scheme for the new task of identifying “good” conversations that occur online, which it is called ERICs: Engaging, Respectful, and/or Informative Conversations, which is one of the largest annotated corpora of online human dialogues, with the most detailed set of annotations.
The SFU Opinion and Comments Corpus: A Corpus for the Analysis of Online News Comments
The SFU Opinion and Comments Corpus, a collection of opinion articles and the comments posted in response to the articles, is presented and a subset of the large corpus (1043 comments) is annotated with four layers of annotations: constructiveness, toxicity, negation and Appraisal.
How Multilingual is Multilingual BERT?
It is concluded that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs, and that the model can find translation pairs.
Ex Machina: Personal Attacks Seen at Scale
A method that combines crowdsourcing and machine learning to analyze personal attacks at scale is developed and illustrated, and an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate is shown.
FinEst BERT and CroSloEngual BERT: less is more in multilingual models
The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations, including NER, POS-tagging, and dependency parsing.
Trawling for Trolling: A Dataset
It is found that these models are sensitive to data ablation which suggests that the dataset is largely devoid of spurious statistical artefacts that could otherwise distract and confuse classification models.
Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages
The HASOC track intends to stimulate development in Hate Speech for Hindi, German and English by identifying Hate Speech in Social Media using LSTM networks processing word embedding input.