Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text

  title={Word Level Language Identification in Assamese-Bengali-Hindi-English Code-Mixed Social Media Text},
  author={Neelakshi Sarma and Sanasam Ranbir Singh and Diganta Goswami},
  journal={2018 International Conference on Asian Language Processing (IALP)},
The content posted over social media platforms today are characterized by code-mixing, phonetic typing, lexical borrowing and neologism making word level language identification an important pre-requisite for various natural language processing applications. [] Key Result From various experimental observations, it is evident that global semantic similarities help in identifying borrowed words, and local contextual similarity helps in resolving words that are valid in multiple languages.

Figures and Tables from this paper

SwitchNet: Learning to switch for word-level language identification in code-mixed social media text
Evaluation over a corpus of transliterated Facebook comments shows that the proposed approach outperforms its baseline counterparts: classification based on the contextual information, classificationbased on the word in isolation, as well as an ensemble of the two classifiers.
Sistem Identifikasi Bahasa Jawa dan Bahasa Indonesia Dokumen Teks Berbasis N-Gram Karakter
Abstrak— Identifikasi bahasa adalah sebuah proses yang mencoba menemukan bahasa yang digunakan dalam sebuah wacana secara otomatis. Sistem Identifikasi Bahasa (SIB) pada dasarnya dibedakan menjadi


“ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification
This study shows that word level language identification is most likely to confuse between languages which are linguistically related (e.g., Hindi and Gujarati, Czech and Slovak), for which special disambiguation techniques might be required.
Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text
A code-mixing index is introduced to evaluate the level of blending in the corpora and the performance of a system developed to separate multiple languages is described.
Word-level Language Identification in Bi-lingual Code-switched Texts
This paper adopts a novel experimental model which considers the language and part-of-speech of adjoining words for word-level language identification of code-switched sentences and shows that the proposed model significantly increases the accuracy over existing approaches.
Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System
A CRF based system for word-level language identification of code-mixed text that uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated across languages.
Code Mixing: A Challenge for Language Identification in the Language of Social Media
A new dataset is described, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi, and it is found that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
An examination of the language distribution of a million tweets, along with temporal analysis, the usage of twitter features across languages, and a correlation study between classifications made and geo-location and language metadata fields are examined.
Word Level Language Identification in Online Multilingual Communication
This work tags the language of individual words using language models and dictionaries and achieves an accuracy of 98%.
Boot-Strapping Language Identifiers for Short Colloquial Postings
This work thoroughly evaluates the use of Wikipedia to build language identifiers for a large number of languages 52 and a large corpus and conducts a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language model profile size and number of language tested.
Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets
This work presents an exploration of automatic NER of code-mixed data, and compares its method with existing off-the-shelf NER tools for social media content, and finds that its systems outperforms the best baseline by 33.18 %.
Codeswitching language identification using Subword Information Enriched Word Vectors
  • M. Xia
  • Computer Science
  • 2016
A supervised machine learning model is developed that identifies languages in a English-Spanish codeswitched tweets by combining subword information enriched word vectors with linear-chain Conditional Random Field and demonstrates that named entity recognition remains a challenge in codeswitching texts.