The IUCL+ System: Word-Level Language Identification via Extended Markov Models

@inproceedings{King2014TheIS,
  title={The IUCL+ System: Word-Level Language Identification via Extended Markov Models},
  author={Levi King and Eric Baucom and Timur Gilmanov and Sandra K{\"u}bler and Daniel Whyatt and Wolfgang Maier and Paul Rodrigues},
  booktitle={CodeSwitch@EMNLP},
  year={2014}
}
We describe the IUCL+ system for the shared task of the First Workshop on Computational Approaches to Code Switching (Solorio et al., 2014), in which participants were challenged to label each word in Twitter texts as a named entity or one of two candidate languages. Our system combines character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognitiontools within a Markovmodel framework that weights these components and assigns a… 

Tables from this paper

An Unsupervised Word Level Language Identification of English and Kokborok Code-Mixed and Code-Switched Sentences
TLDR
This is the first language identification work dedicated to low resource English-Kokborok language pairs and combines a frequency lexicon based, character n-gram language model and a language dependent morphological dictionary-based model for correctly classifying each word.
LILI: A Simple Language Independent Approach for Language Identification
TLDR
A generic Language Independent Framework for Linguistic Code Switch Point Detection uses characters level 5-grams and word level unigram language models to train a conditional random fields model for classifying input words into various languages.
Overview for the Second Shared Task on Language Identification in Code-Switched Data
TLDR
The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
Overview for the First Shared Task on Language Identification in Code-Switched Data
TLDR
The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic
TLDR
A hybrid approach for performing token and sentence levels Dialect Identification in Arabic to identify whether each token in a given sentence belongs to Modern Standard Arabic, Egyptian Dialectal Arabic or some other class and whether the whole sentence is mostly EDA or MSA.
A deep learning approach for the romanized tunisian dialect identification
TLDR
Segmented and annotated a corpus extracted from social media and propose a deep learning approach for the identification of the Romanized user-generated Tunisian dialect on the social web.
Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus
TLDR
This paper trains recurrent neural networks with only raw features, and uses word embedding to automatically learn meaningful representations, and is able to outperform the best SVM-based systems reported in the EMNLP'14 Code-Switching Workshop by 1% in accuracy, or by 17% in error rate reduction.
Code-Mixing: A Brief Survey
  • S. Thara, P. Poornachandran
  • Computer Science
    2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
  • 2018
TLDR
A comprehensive study of code mixing in the diverse fields of Natural Language Processing (NLP) including language identification, Part-of-Speech (POS) tagging, Named Entity Recognition (NER), Polarity Identification, Question Answering.
Segregation of Code-Switching Sentences using Rule-Based Technique
TLDR
The ratio of word presence was used to segregate the sentences and the rulebased technique performed with accuracy of more than 87.00% for Malay-English code-switching (MY-EN-CS) sentences.
An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi
TLDR
A deep learning model based on BLSTM that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in thesequence is implemented that gives better accuracy for word embedding model as compared to character embedding evaluated on two testing sets.
...
1
2
...

References

SHOWING 1-10 OF 13 REFERENCES
Overview for the Second Shared Task on Language Identification in Code-Switched Data
TLDR
The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
Overview for the First Shared Task on Language Identification in Code-Switched Data
TLDR
The evaluation showed that language identification at the token level is more difficult when the languages present are closely related, as in the case of MSA-DA, where the prediction performance was the lowest among all language pairs.
Word-level language identification in The Chymistry of Isaac Newton
TLDR
The task of word-based language identification in multilingual texts, in which every word needs to be classified with regard to its language, is introduced and a novel method based on character n-grams in combination with a weighting scheme that allows to model the probability of language switches at different points in sentences is presented.
Word Level Language Identification in Online Multilingual Communication
TLDR
This work tags the language of individual words using language models and dictionaries and achieves an accuracy of 98%.
Arabic Named Entity Recognition using Conditional Random Fields
TLDR
A further attempt to enhance the accuracy of ANERsys by changing the probabilistic model from Maximum Entropy to Conditional Random Fields which helped to improve the results significantly.
Named Entity Recognition in Tweets: An Experimental Study
TLDR
The novel T-ner system doubles F1 score compared with the Stanford NER system, and leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision.
Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex
TLDR
A generalized approach to language identiication of on-line text based on techniques of cryptanalysis is outlined, and the results are promising.
Accurate Language Identification of Twitter Messages
TLDR
It is found that simple voting over three specific systems consistently outperforms any specific system, and achieves state-of-the-art accuracy on the task.
Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods
TLDR
In this paper, the problem of labeling the languages of words in mixed-language documents is considered in a weakly supervised fashion, as a sequence labeling problem with monolingual text samples for training data.
Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling
TLDR
By using simulated annealing in place of Viterbi decoding in sequence models such as HMMs, CMMs, and CRFs, it is possible to incorporate non-local structure while preserving tractable inference.
...
1
2
...