• Corpus ID: 235435725

Challenges and Considerations with Code-Mixed NLP for Multilingual Societies

  title={Challenges and Considerations with Code-Mixed NLP for Multilingual Societies},
  author={Vivek Srivastava and Mayank Kumar Singh},
Multilingualism refers to the high degree of proficiency in two or more languages in the written and oral communication modes. It often results in language mixing, a.k.a. codemixing, when a multilingual speaker switches between multiple languages in a single utterance of a text or speech. This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good— crisis management, healthcare, political… 

Figures and Tables from this paper

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

The Prabhupadavani is a multilingual code-mixed ST dataset for 25 languages, which is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language.

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

The first large-scale real Hindi-English code mixed data in a Roman script is presented, and L3Cube-HingCorpus, the largest code-mixed Hindi- English language identification(LID) dataset and HingBERT-LID, a production-quality LID model are released to facilitate cap-turing of more code- mixed data using the process outlined in this work.



GLUECoS: An Evaluation Benchmark for Code-Switched NLP

This work presents an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish, and shows that in most tasks, across both language pairs, multilingual models fine-tuned on code- Switched data perform best, showing that mult bilingual models can be further optimized forcode-switching tasks.

A Semi-supervised Approach to Generate the Code-Mixed Text using Pre-trained Encoder and Transfer Learning

This work proposes an effective deep learning approach for automatically generating the code-mixed text from English to multiple languages without any parallel data, and transfers the knowledge from a neural machine translation to warm-start the training of code- mixed generator.

LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation

A centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines eleven corpora covering four different code-switched language pairs and four tasks and provides the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems.

Hindi-English Hate Speech Detection: Author Profiling, Debiasing, and Practical Perspectives

A three-tier pipeline that employs profanity modeling, deep graph embeddings, and author profiling to retrieve instances of hate speech in Hindi-English code-switched language (Hinglish) on social media platforms like Twitter is introduced.

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

A computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory is presented and it is shown that when training examples are sampled appropriately from this synthetic data and presented in certain order, it can significantly reduce the perplexity of an RNN-based language model.

Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques

This work systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish, Tenglish and Tamlish which belong to two language families (Indo-Aryan and Dravidian) which are prevalent in bi- and multi-lingual communities.

POS Tagging of English-Hindi Code-Mixed Social Media Content

The initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums are described, and language identification, back-transliteration, normalization and POS tagging of this data are explored.

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this

IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection

This work presents a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier to classify the Hinglish code-mixed text into one of the three sentiment classes positive, negative, or neutral.

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.