• Publications
  • Influence
Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks
This work explores the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web, and proposes a technique for bias correction that significantly improves annotation quality on two tasks. Expand
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
A tagset is developed, data is annotated, features are developed, and results nearing 90% accuracy are reported on the problem of part-of-speech tagging for English data from the popular micro-blogging service Twitter. Expand
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
We connect measures of public opinion measured from polls with sentiment measured from text. We analyze several surveys on consumer confidence and political opinion over the 2008 to 2009 period, andExpand
Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
This work systematically evaluates the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy on Twitter and achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks. Expand
A Latent Variable Model for Geographic Lexical Variation
A multi-level generative model that reasons jointly about latent topics and geographical regions is presented, which recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. Expand
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena. Expand
Censorship and deletion practices in Chinese social media
This work presents the first large–scale analysis of political content censorship in social media, i.e. the active deletion of messages published by individuals, and uncovers a set of politically sensitive terms whose presence in a message leads to anomalously higher rates of deletion. Expand
Learning Latent Personas of Film Characters
We present two latent variable models for learning character types, or personas, in film, in which a persona is defined as a set of mixtures over latent lexical classes. These lexical classes captureExpand
TweetMotif: Exploratory Search and Topic Summarization for Twitter
This work presents TweetMotif, an exploratory search application for Twitter that groups messages by frequent significant terms — a result set’s subtopics — which facilitate navigation and drilldown through a faceted search interface. Expand
Diffusion of Lexical Change in Social Media
Using a latent vector autoregressive model to aggregate across thousands of words, high-level patterns in diffusion of linguistic change over the United States are identified and support for prior arguments that focus on geographical proximity and population size is offered. Expand