• Corpus ID: 33148539

Improving the utility of social media with Natural Language Processing

  title={Improving the utility of social media with Natural Language Processing},
  author={Bo Han},
  • Bo Han
  • Published 2014
  • Computer Science
Social media has been an attractive target for many natural language processing (NLP) tasks and applications in recent years. However, the unprecedented volume of data and the non-standard language register cause problems for off-the-shelf NLP tools. This thesis investigates the broad question of how NLP-based text processing can improve the utility (i.e., the effectiveness and efficiency) of social media data. In particular, text normalisation and geolocation prediction are closely examined in… 
Multilingual Sequence Labeling Approach to solve Lexical Normalization
A sequence labeling approach to solve the problem of Lexical Normalization in combination with the word-alignment technique and highlights the effects of using additional training data to get better results as well as using a pre-trained Language model trained on multiple languages rather than only on one language.
Predicting real estate market trends and value using pre-processing and sentiment text mining analysis
The main aim behind text mining is to convert large corpus of text into numbers by applying influential mining technique to extract meaningful knowledge patterns from text sources through the identification and exploration of fascinating patterns.
Lexical Normalization for Code-switched Data and its Effect on POS Tagging
This paper proposes three normalization models specifically designed to handle code-switched data which are evaluated for two language pairs: Indonesian-English and Turkish-German, and introduces novel normalization layers and their corresponding language ID and POS tags for the dataset.
Transferring Informal Text in Arabic as Low Resource Languages: State-of-the-Art and Future Research Directions
The Arabic and Arabic dialects are focused on as a low resource language in the era of transferring non-stander text using normalization and translation approach because of lack of enough parallel dataset.
Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions
This paper presents the Wikipedia Cultural Diversity dataset, a dataset that contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken.
A Pragmatic Guide to Geoparsing Evaluation Toponyms , Named Entity Recognition and Pragmatics
This manuscript introduces a new framework describing the task, metrics and data used to compare state-of-the-art systems in geoparsing and proposes a fine-grained Pragmatic Taxonomy of Toponyms with implications for Named Entity Recognition (NER) and beyond.
Détection des mots non-standards dans les tweets avec des réseaux de neurones (Detecting non-standard words in tweets with neural networks)
Cette détection des mots à corriger est l’étape préliminaire pour the normalisation des textes non standards comme les tweets.
A pragmatic guide to geoparsing evaluation
A new framework describing the task, metrics and data used to compare state-of-the-art systems and proposing a fine-grained Pragmatic Taxonomy of Toponyms with implications for Named Entity Recognition (NER) and beyond is introduced.


A Broad-Coverage Normalization System for Social Media Language
A cognitively-driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity is proposed.
A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation
A novel beam-search decoder is proposed to effectively integrate various normalization operations and shows statistically significant improvements over two strong baselines in both normalization and translation tasks, for both Chinese and English.
Named Entity Recognition in Tweets: An Experimental Study
The novel T-ner system doubles F1 score compared with the Stanford NER system, and leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision.
Syntactic Normalization of Twitter Messages
This paper describes a novel system which normalizes Twitter posts, converting them into a more standard form of English, so that standard machine translation (MT) and natural language processing (NLP) techniques can be more easily applied to them.
Adaptive Parser-Centric Text Normalization
This paper takes a parser-centric view of normalization that aims to convert raw informal text into grammatically correct text, and demonstrates that this approach outperforms not only the state-of-the-art wordto-word normalization techniques, but also manual word-to- word annotations.
Using paraphrases for improving first story detection in news and Twitter
A novel way of integrating paraphrases with locality sensitive hashing (LSH) is shown in order to obtain an efficient FSD system that can scale to very large datasets and achieves state-of-the-art results on the first story detection task.
Normalizing Microtext
This work proposes a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factors, a contextual factor and acronym expansion, which can normalize Twitter messages reasonably well and outperforms existing algorithms on a public SMS data set.
Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks.
The where in the tweet
This paper attempts to predict the POI tag of a tweet based on its textual content and time of posting, and uses web pages retrieved by search engines as an additional source of evidence to tackle the sparsity of tweets tagged with POIs.
Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language
Twitter’s language is surprisingly more conservative, and less informal than SMS and online chat, and Twitter users appear to be developing linguistically unique styles, as well as several key insights.