A correlated topic model of Science

Abstract

Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. We derive a fast varia-tional inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections. 1. Introduction. Large collections of documents are readily available on-line and widely accessed by diverse communities. As a notable example, scholarly articles are increasingly published in electronic form, and historical archives are being scanned and made accessible. The not-for-profit organization JSTOR (www.jstor.org) is currently one of the leading providers of journals to the scholarly community. These archives are created by scanning old journals and running an optical character recognizer over the pages. JSTOR provides the original scans on-line, and uses their noisy version of

Extracted Key Phrases

5 Figures and Tables

Showing 1-10 of 297 extracted citations
050100'06'07'08'09'10'11'12'13'14'15'16'17
Citations per Year

558 Citations

Semantic Scholar estimates that this publication has received between 453 and 688 citations based on the available data.

See our FAQ for additional information.