Learn More
We address the problem of part-of-speech tagging for English data from the popular micro-blogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related(More)
Generative models of text typically associate a multinomial with every class label or topic. Even in simple models this requires the estimation of thousands of parameters; in multi-faceted latent variable models, standard approaches require additional latent " switching " variables for every token, complicating inference. In this paper, we propose an(More)
The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as " sports " or " entertainment " are rendered differently in each geographic(More)
We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various(More)
The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language , and offers a critical review of the NLP community's response, which has largely followed two paths: normalization and domain(More)
We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors' geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite 1,∞ regularizer, we obtain structured sparsity,(More)
Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new(More)