Jacob Eisenstein

Learn More
We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related(More)
The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic(More)
This paper describes a novel Bayesian approach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of wellformed segments to induce a compact and consistent lexical distribution. We show that lexical cohesion can be placed in a Bayesian context by modeling the words in each topic segment as(More)
Model-based interface development systems have not been able to progress beyond producing narrowly focused interface designs of restricted applicability. We identify a level-of-abstraction mismatch in interface models, which we call the mapping problem, as the cause of the limitations in the usefulness of model-based systems. We propose a general(More)
We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite `1,∞ regularizer, we obtain structured sparsity,(More)
The rise of social media has brought computational linguistics in ever-closer contact with bad language: text that defies our expectations about vocabulary, spelling, and syntax. This paper surveys the landscape of bad language, and offers a critical review of the NLP community’s response, which has largely followed two paths: normalization and domain(More)