Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

@article{Carter2013MicroblogLI,
  title={Microblog language identification: overcoming the limitations of short, unedited and idiomatic text},
  author={Simon Carter and Wouter Weerkamp and Manos Tsagkias},
  journal={Language Resources and Evaluation},
  year={2013},
  volume={47},
  pages={195-215}
}
Multilingual posts can potentially affect the outcomes of content analysis on microblog platforms. To this end, language identification can provide a monolingual set of content for analysis. We find the unedited and idiomatic language of microblogs to be challenging for state-of-the-art language identification methods. To account for this, we identify five microblog characteristics that can help in language identification: the language profile of the blogger (blogger), the content of an… 
TweetLID: a benchmark for tweet language identification
TLDR
The work on the development of a benchmark to encourage further research in language identification, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes are described.
Language Identification for Creating Language-Specific Twitter Collections
TLDR
This work annotates and releases a large collection of tweets in nine languages, focusing on confusable languages using the Cyrillic, Arabic, and Devanagari scripts, the first publicly-available collection of LID-annotated tweets in non-Latin scripts and should become a standard evaluation set for LID systems.
Mining multilingual and multiscript Twitter data: unleashing the language and script barrier
TLDR
This work has developed a system that automatically identifies and classifies native tweets, irrespective of the script used, and found that the proposed framework gives better precision than the prevailing approaches.
Boot-Strapping Language Identifiers for Short Colloquial Postings
TLDR
This work thoroughly evaluates the use of Wikipedia to build language identifiers for a large number of languages 52 and a large corpus and conducts a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language model profile size and number of language tested.
Language Identification for Social Media: Short Messages and Transliteration
TLDR
This work uses a previously trained general purpose language identification model to semi-automatically label a large corpus of tweets - in order to train a tweet-specific language identification models, and gives special attention to text written in transliterated Arabic and Russian.
Towards Normalising Konkani-English Code-Mixed Social Media Text
TLDR
This work is the first attempt at the creation of a linguistic resource for this language pair, developed a language identification and Normalisation System for Konkani-Englsih language pair and describes a new dataset which contains of more than thousands posts from Facebook posts that exhibit code mixing between KonKaniEnglish and English.
Toward Accurate Social Media Language Identification: Combining Language Features with a Graphical Approach
  • K. Abainia
  • Computer Science
    2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS)
  • 2018
TLDR
This work addresses the benchmarked problem of LID of Twitter messages, where an effective approach (HAG) is presented to deal with the three difficulties of language identification: noisy and short texts, multilingual texts, and similar languages.
Language identification of multilingual posts from Twitter: a case study
TLDR
A method for handling multi-class and multi-label classification problems based on the support vector machine formalism for language identification problem in Twitter is described and a threshold-based strategy to favor classes with less data is proposed.
A Vectorization Approach to Language Identification of Social Media Short Texts
TLDR
A vectorization-based approach, which exploits weighted n-gram statistical features and improved Cascade Forest approach to conduct accurate language identification of social media short texts and achieves better accuracy, precision and recall rates when compared with the state-of-art machine learning algorithms and off-the-shelf approaches.
...
...

References

SHOWING 1-10 OF 29 REFERENCES
Semi-Supervised Priors for Microblog Language Identification
TLDR
This paper explores the performance of a state-of-the-art n-gram-based language identifier, and introduces two semi-supervised priors to enhance performance at microblog post level.
Twitter power: Tweets as electronic word of mouth
TLDR
It is found that microblogting is an online tool for customer word of mouth communications and the implications for corporations using microblogging as part of their overall marketing strategy are discussed.
Language identification of names with SVMs
TLDR
It is shown that an approach based on SVMs with n-gram counts as features performs much better than language models on the problem of language identification of names.
Do all birds tweet the same?: characterizing twitter around the world
TLDR
This paper presents a summary of a large-scale analysis of Twitter for an extended period of time and reports differences and similarities in terms of activity, sentiment, use of languages, and network structure, the first on-line social network study of such characteristics.
Language Identification: The Long and the Short of the Matter
TLDR
It is demonstrated that the task becomes increasingly difficult as the authors increase the number of languages, reduce the amount of training data and reduce the length of documents, and it is shown that it is possible to perform language identification without having to perform explicit character encoding detection.
Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment
TLDR
It is found that the mere number of messages mentioning a party reflects the election result, and joint mentions of two parties are in line with real world political ties and coalitions.
A Comparison of Language Identification Approaches on Short, Query-Style Texts
TLDR
This work compares the performance of some typical approaches for language detection on very short, query-style texts and shows that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts the authors even observed accuracy values close to 100%.
Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts
TLDR
A language modeling approach tailored to microblogging characteristics, where redundancy-based IR methods cannot be used in a straightforward manner, is developed and a dynamic query expansion model for microblog post retrieval is proposed.
Language identification in web pages
TLDR
The language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure, and achieves very high accuracy in discriminating different languages on Web pages.
N-gram-based text categorization
TLDR
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
...
...