• Corpus ID: 3894514

Graph-Based N-gram Language Identication on Short Texts

  title={Graph-Based N-gram Language Identication on Short Texts},
  author={Erik Tromp and Mykola Pechenizkiy},
classication, n-gram Abstract Language identication (LI) is an important task in natural language processing. Sev- eral machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA outperforms the state-of-the-art N-gram approach on Twitter messages LI. 

Figures and Tables from this paper

Language identification in multilingual, short and noisy texts using common N-grams
This paper explores and discusses LID in short and noisy messages written in similar languages, which is a non-trivial task, especially for very related languages and explores a novel distance based classification method — Common N-Grams (CNG).
Graph-based Semi-supervised Learning for Text Classification
It is found that graph-based semi-supervised learning outperforms bag-of-words semi- supervised learning but not bag- of-words supervised learning in 20-class text categorization.
An Emergent Approach to Text Analysis Based on a Connectionist Model and the Web
The method builds a connectionist structure of relationships between word n-grams and provides a representation of the sentence that allows emerging the least prominent usage-based relational patterns, helping to easily find badly-written and unpopular text.
Language Identification from Text Documents
This study engaged these two emerging fields to come up with a robust language identifier on demand, namely Stanford Language Identification Engine (SLIDE), and achieved 95.12% accuracy in Discriminating between Similar Languages (DSL) Shared Task 2015 dataset, beating the maximum reported accuracy.
Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
This work presents a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data.
Language Identification for Creating Language-Specific Twitter Collections
This work annotates and releases a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts, the first publicly-available collection of LID-annotated tweets in non-Latin scripts and should become a standard evaluation set for LID systems.
Fast and Accurate Language Detection in Short Texts using Contextual Entropy
The results show that the language of the text, in the challenging case of short texts, can be accurately identified, matching state of the art approaches reported in the literature.
Automatic processing of code-mixed social media content
This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-Mixed content, and finds that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach.
Boot-Strapping Language Identifiers for Short Colloquial Postings
This work thoroughly evaluates the use of Wikipedia to build language identifiers for a large number of languages 52 and a large corpus and conducts a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language model profile size and number of language tested.
Automatic Language Identification in Texts: A Survey
A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.


A Comparative Study on Language Identification Methods
This work presents the evaluation results and discusses the importance of a dynamic value for the out-of-place measure and the Ad-Hoc Ranking classification method.
Language Recognition for Mono-and Multi-lingual Documents
The monolingual algorithm, which allows for segmenting a multilingual document into single language chunks and identifying the languages of those chunks, significantly outperforms commonly used language recognition algorithms.
N-gram-based text categorization
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Linguini: language identification for multilingual documents
  • J. Prager
  • Computer Science
    Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
  • 1999
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
Language Identification from Text Using N-gram Based Cumulative Frequency Addition
The preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams are described, which is simpler than the conventional Naive Bayesian classification method but performs similarly in speed overall and better in accuracy on short input strings.
Evaluation of Language Identification Methods
Three freely available language identification programs are tested and evaluated and explained how they work, and commented on their accuracy.
Using compression based language models for text categorization.
Two approaches to compression-based categorization are presented, one based on ranking by documentCross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models.
Extension of Zipf's Law to Word and Character N-grams for English and Chinese
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks, but when single words or characters are combined together with n-gram Words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately.
Statistical Identification of Language
Multilingual sentiment analysis on social media. Master's thesis
  • Multilingual sentiment analysis on social media. Master's thesis
  • 2011