• Corpus ID: 3894514

Graph-Based N-gram Language Identication on Short Texts

  title={Graph-Based N-gram Language Identication on Short Texts},
  author={Erik Tromp and Mykola Pechenizkiy},
classication, n-gram Abstract Language identication (LI) is an important task in natural language processing. Sev- eral machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA outperforms the state-of-the-art N-gram approach on Twitter messages LI. 

Figures and Tables from this paper

Language identification in multilingual, short and noisy texts using common N-grams

This paper explores and discusses LID in short and noisy messages written in similar languages, which is a non-trivial task, especially for very related languages and explores a novel distance based classification method — Common N-Grams (CNG).

Graph-based Semi-supervised Learning for Text Classification

It is found that graph-based semi-supervised learning outperforms bag-of-words semi- supervised learning but not bag- of-words supervised learning in 20-class text categorization.

An Emergent Approach to Text Analysis Based on a Connectionist Model and the Web

The method builds a connectionist structure of relationships between word n-grams and provides a representation of the sentence that allows emerging the least prominent usage-based relational patterns, helping to easily find badly-written and unpopular text.

Language Identification from Text Documents

This study engaged these two emerging fields to come up with a robust language identifier on demand, namely Stanford Language Identification Engine (SLIDE), and achieved 95.12% accuracy in Discriminating between Similar Languages (DSL) Shared Task 2015 dataset, beating the maximum reported accuracy.

Language Identification for Creating Language-Specific Twitter Collections

This work annotates and releases a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts, the first publicly-available collection of LID-annotated tweets in non-Latin scripts and should become a standard evaluation set for LID systems.

Automatic processing of code-mixed social media content

This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-Mixed content, and finds that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach.

Boot-Strapping Language Identifiers for Short Colloquial Postings

This work thoroughly evaluates the use of Wikipedia to build language identifiers for a large number of languages 52 and a large corpus and conducts a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language model profile size and number of language tested.

Automatic Language Identification in Texts: A Survey

A unified notation is introduced for evaluation methods, applications, as well as off-the-shelf LI systems that do not require training by the end user, to propose future directions for research in LI.

TweetLID: a benchmark for tweet language identification

The work on the development of a benchmark to encourage further research in language identification, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes are described.



Language Recognition for Mono-and Multi-lingual Documents

The monolingual algorithm, which allows for segmenting a multilingual document into single language chunks and identifying the languages of those chunks, significantly outperforms commonly used language recognition algorithms.

N-gram-based text categorization

An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

Linguini: language identification for multilingual documents

  • J. Prager
  • Computer Science
    Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
  • 1999
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

The preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams are described, which is simpler than the conventional Naive Bayesian classification method but performs similarly in speed overall and better in accuracy on short input strings.

Evaluation of Language Identification Methods

Three freely available language identification programs are tested and evaluated and explained how they work, and commented on their accuracy.

Using compression based language models for text categorization.

Two approaches to compression-based categorization are presented, one based on ranking by documentCross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models.

Statistical Identification of Language

Extension of Zipf's Law to Word and Character N-grams for English and Chinese

It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks, but when single words or characters are combined together with n-gram Words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately.

A Comparative Study on Language Identification Methods

This work presents the evaluation results and discusses the importance of a dynamic value for the out-of-place measure and the Ad-Hoc Ranking classification method.

Multilingual sentiment analysis on social media. Master's thesis

  • Multilingual sentiment analysis on social media. Master's thesis
  • 2011