XD at SemEval-2020 Task 12: Ensemble Approach to Offensive Language Identification in Social Media Using Transformer Encoders

@article{Dong2020XDAS,
  title={XD at SemEval-2020 Task 12: Ensemble Approach to Offensive Language Identification in Social Media Using Transformer Encoders},
  author={Xiangjue Dong and Jinho D. Choi},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.10945}
}
This paper presents six document classification models using the latest transformer encoders and a high-performing ensemble model for a task of offensive language identification in social media. For the individual models, deep transformer layers are applied to perform multi-head attentions. For the ensemble model, the utterance representations taken from those individual models are concatenated and fed into a linear decoder to make the final decisions. Our ensemble model outperforms the… 
1 Citations

Figures and Tables from this paper

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

TLDR
The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish.

References

SHOWING 1-10 OF 24 REFERENCES

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

TLDR
The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish.

Challenges in discriminating profanity from hate speech

TLDR
Analysis of the results reveals that discriminating hate speech and profanity is not a simple task, which may require features that capture a deeper understanding of the text not always possible with surface -grams.

Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language

TLDR
This pilot edition of the GermEval Shared Task on the Identification of Offensive Language deals with the classification of German tweets from Twitter and describes the process of extracting the raw-data for the data collection and the annotation schema.

Predicting the Type and Target of Offensive Posts in Social Media

TLDR
The Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, is complied and made publicly available.

Using Convolutional Neural Networks to Classify Hate-Speech

TLDR
A deep learning-based Twitter hate-speech text classification system that assigns each tweet to one of four predefined categories: racism, sexism, both (racism and sexism) and non-hate-speech.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Automated Hate Speech Detection and the Problem of Offensive Language

TLDR
This work used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords and labels a sample of these tweets into three categories: those containinghate speech, only offensive language, and those with neither.

A Simple Method for Commonsense Reasoning

TLDR
Key to this method is the use of language models, trained on a massive amount of unlabled data, to score multiple choice questions posed by commonsense reasoning tests, which outperform previous state-of-the-art methods by a large margin.