Finding scientific topics

  title={Finding scientific topics},
  author={Thomas L. Griffiths and Mark Steyvers},
  journal={Proceedings of the National Academy of Sciences of the United States of America},
  pages={5228 - 5235}
  • T. Griffiths, M. Steyvers
  • Published 6 April 2004
  • Computer Science
  • Proceedings of the National Academy of Sciences of the United States of America
A first step in identifying the content of a document is determining which topics that document addresses. [] Key Method 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics…

Figures from this paper

Probabilistic author-topic models for information discovery
The methodology is applied to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and a model with 300 topics is learned using a Markov chain Monte Carlo algorithm.
A network approach to topic models
A new approach to topic models finds topics through community detection in word-document networks by adapting existing community-detection methods using a stochastic block model with nonparametric priors, and shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
The Author-Topic Model for Authors and Documents
The author-topic model is introduced, a generative model for documents that extends Latent Dirichlet Allocation to include authorship information, and applications to computing similarity between authors and entropy of author output are demonstrated.
Sequential Latent Dirichlet Allocation: Discover Underlying Topic Structures within a Document
By taking into account the sequential structure within a document, the SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility) and yields a nicer sequential topic structure than LDA.
Interpreting document collections with topic models
This thesis looks at the problem of identifying incoherent topics, and proposes novel methods for efficiently identifying semantically related topics which can be used for topic recommendation and proposes approaches that provide textual or image labels which assist in topic interpretability.
Detecting research topics via the correlation between graphs and texts
This paper presents a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term.
Learning author-topic models from text corpora
The interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors are discussed.
Structured Topic Models for Language
This thesis introduces new methods for statistically modelling text using topic models that combine latent topics with information about document structure, ranging from local sentence structure to inter-document relationships.
A correlated topic model of Science
The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution, and it is demonstrated its use as an exploratory tool of large document collections.
Extracting Representative Words of a Topic Determined by Latent Dirichlet Allocation
Experimental results show that the proposed method to estimate representative words of each topic from an LDA result provides better information for interpreting a topic than LDA does.


Unsupervised Learning by Probabilistic Latent Semantic Analysis
This paper proposes to make use of a temperature controlled version of the Expectation Maximization algorithm for model fitting, which has shown excellent performance in practice, and results in a more principled approach with a solid foundation in statistical inference.
Monte Carlo Strategies in Scientific Computing
The strength of this book is in bringing together advanced Monte Carlo methods developed in many disciplines, including the Ising model, molecular structure simulation, bioinformatics, target tracking, hypothesis testing for astronomical observations, Bayesian inference of multilevel models, missing-data problems.
Expectation-Propogation for the Generative Aspect Model
This paper demonstrates that the simple variational methods of Blei et al. (2001) can lead to inaccurate inferences and biased learning for the generative aspect model, and develops an alternative approach that leads to higher accuracy at comparable cost.
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
  • S. Geman, D. Geman
  • Physics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1984
The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Monte Carlo Methods in Statistical Physics
This book provides an introduction to Monte Carlo simulations in classical statistical physics and is aimed both at students beginning work in the field and at more experienced researchers who wish
Markov Chain Monte Carlo in Practice
The Markov Chain Monte Carlo Implementation Results Summary and Discussion MEDICAL MONITORING Introduction Modelling Medical Monitoring Computing Posterior Distributions Forecasting Model Criticism Illustrative Application Discussion MCMC for NONLINEAR HIERARCHICAL MODELS.
Foundations of statistical natural language processing
This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Fundamental theorem of natural selection under gene-culture transmission.
  • C. Findlay
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 1991
It is shown that cultural transmission has several important implications for the evolution of population fitness, most notably that there is a time lag in the response to selection such that the future evolution depends on the past selection history of the population.
In Advances in Neural Information Processing Systems
Bill Baird { Publications References 1] B. Baird. Bifurcation analysis of oscillating neural network model of pattern recognition in the rabbit olfactory bulb. In D. 3] B. Baird. Bifurcation analysis
1997 IEEE Workshop on Automatic Speech Recognition and Understanding : proceedings
This workshop focuses on the recent progress and new ground-breaking paradigms of automatic speech recognition and understanding, with robust modeling as the main theme.