• Corpus ID: 245650751

Toxicity Detection for Indic Multilingual Social Media Content

  title={Toxicity Detection for Indic Multilingual Social Media Content},
  author={Manan A. Jhaveri and Devanshu Ramaiya and Harveen Singh Chadha},
Toxic content is one of the most critical issues for social media platforms today. India alone had 518 million social media users in 2020. In order to provide a good experience to content creators and their audience, it is crucial to flag toxic comments and the users who post that. But the big challenge is identifying toxicity in low resource Indic languages because of the presence of multiple representations of the same text. Moreover, the posts/comments on social media do not adhere to a… 

Figures and Tables from this paper

Multilingual and Multimodal Abuse Detection

The proposed method, MADA, explicitly focuses on two modalities other than the audio itself, namely, the underlying emotions expressed in the abusive audio and the semantic information encapsulated in the corresponding text.



iNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA.

MuRIL: Multilingual Representations for Indian Languages

MuRIL is proposed, a multilingual LM specifically built for Indian (IN) languages that is trained on significantly large amounts of IN text corpora only and explicitly augment monolingualText corpora with both transliterated data and training data.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Cross-Lingual Ability of Multilingual BERT: An Empirical Study

A comprehensive study of the contribution of different components in M-BERT to its cross-lingual ability, finding that the lexical overlap between languages plays a negligible role, while the depth of the network is an integral part of it.

Rethinking embedding coupling in pre-trained language models

The analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Experiment tracking with weights and biases, 2020. Software available from wandb

    Crosslingual ability of multilingual bert: An empirical study, 2020

    • 2020