Learn More
Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately , typical dimensionality reduction methods for text, such as latent Dirichlet allocation , often produce low-dimensional sub-spaces(More)
Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile , massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize content in many languages. We(More)
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean method and empirical likelihood method. In this paper, we(More)
Some models of textual corpora employ text generation methods involving <i>n</i>-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both(More)
With the widespread proliferation of social media tools such as Facebook and Twitter, the CSCW community has seen a growing interest among researchers to turn to records of social behavior from blogs, social media, and social networking sites, to study human social behavior. This nascent area, that has begun to be referred to in various research circles as(More)
Deep belief networks are a powerful way to model complex probability distributions. However, it is difficult to learn the structure of a belief network, particularly one with hidden units. The Indian buffet process has been used as a nonparametric Bayesian prior on the structure of a directed belief network with a single infinitely wide hidden layer. Here,(More)
The task of assigning label sequences to a set of observation sequences arises in many fields, including bioinformatics, computational linguistics and speech recognition [6, 9, 12]. For example, consider the natural language processing task of labeling the words in a sentence with their corresponding part-of-speech (POS) tags. In this task, each word is(More)
2008 2 Declaration I hereby declare that my dissertation, entitled " Structured Topic Models for Language " , is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other university. No part of my dissertation has already been, or is concurrently being, submitted for any degree, diploma, or other(More)
Computational social science is an emerging research area at the intersection of computer science, statistics, and the social sciences, in which novel computational methods are used to answer questions about society. The field is inherently collaborative: social scientists provide vital context and insight into pertinent research questions, data sources,(More)