• Corpus ID: 184114761

Detección de Idioma en Twitter

@inproceedings{AlmeidaCruz2014DeteccinDI,
  title={Detecci{\'o}n de Idioma en Twitter},
  author={Yudivi{\'a}n Almeida-Cruz and Suilan Est{\'e}vez-Velarde and Alejandro Piad-Morffis},
  year={2014}
}
El trabajo presenta una alternativa para identificar idiomas en Twitter sin que sea necesario utilizar conjuntos de entrenamiento o informacion agregada. En dicha alternativa se utilizan tecnicas basadas en los algoritmos de reconocimiento de trigramas y small words. Se valora la utilizacion de estos algoritmos por si solos y en un modelo de composicion. Asimismo, se analiza la incidencia del pre-procesamiento de los tweets en la precision de la identificacion de los idiomas. Finalmente… 

References

SHOWING 1-10 OF 12 REFERENCES
Análisis de opiniones en Internet a partir de la red social Twitter.
TLDR
An assessment process has been developed using machine learning models plus natural language processing techniques to evaluate public opinion on products, brands, people, etc.
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
TLDR
An examination of the language distribution of a million tweets, along with temporal analysis, the usage of twitter features across languages, and a correlation study between classifications made and geo-location and language metadata fields are examined.
Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena
TLDR
It is speculated that large scale analyses of mood can provide a solid platform to model collective emotive trends in terms of their predictive value with regards to existing social as well as economic indicators.
Graph-Based N-gram Language Identication on Short Texts
TLDR
A graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts and outperforms the state-of-the-art N- gram approach on Twitter messages LI.
Reconsidering Language Identification for Written Language Resources
TLDR
A review of previous research in written language identification reveals a number of questions which remain open and ripe for further investigation.
N-gram-based text categorization
TLDR
An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Automatic Detection and Language Identification of Multilingual Documents
TLDR
This work introduces a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions.
Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction
TLDR
A hybrid method called Tribayes is introduced that combines the best of the previous two methods based on word trigrams and is found to have substantially higher performance than the grammar checker in Microsoft Word.
A Comparison of Language Identification Approaches on Short, Query-Style Texts
TLDR
This work compares the performance of some typical approaches for language detection on very short, query-style texts and shows that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts the authors even observed accuracy values close to 100%.
Language identification: a solved problem suitable for undergraduate instruction
TLDR
This paper describes the main methods used in attacking the language identification problem and demonstrates how even the most simple of these methods using data obtained from the World Wide Web achieve accuracy approaching 100% on a test suite comprised of ten European languages.
...
...