A Model of Text for Experimentation in the Social Sciences

  title={A Model of Text for Experimentation in the Social Sciences},
  author={Margaret E. Roberts and Brandon M Stewart and Edoardo M. Airoldi},
  journal={Journal of the American Statistical Association},
  pages={1003 - 988}
ABSTRACT Statistical models of text have become increasingly popular in statistics and computer science as a method of exploring large document collections. Social scientists often want to move beyond exploration, to measurement and experimentation, and make inference about social and political processes that drive discourse and content. In this article, we develop a model of text data that supports this type of substantive research. Our approach is to posit a hierarchical mixed membership… 
Keyword Assisted Topic Models
It is empirically demonstrate that providing topic models with a small number of keywords can substantially improve their performance, and the proposed keyword assisted topic model (keyATM) provides more interpretable results, has better document classification performance and is less sensitive to the number of topics than the standard topic models.
Discovery of Treatments from Text Corpora
A new experimental design and statistical model is introduced to simultaneously discover treatments in a corpora and estimate causal effects for these discovered treatments and the effects of these interventions in a test set of new texts and survey respondents.
Cross-structural Factor-topic Model: Document Analysis with Sophisticated Covariates
A novel factor-topic model that enables researchers to analyze latent structure in both text and sophisticated document-level covariates collectively, and that also learns the underlying factorial structure from the covariates and the interactions between the two structures.
Temporal Topic Analysis with Endogenous and Exogenous Processes
A hierarchical Bayesian topic model is proposed which imposes a "group-correlated" hierarchical structure on the evolution of topics over time incorporating both processes, and it is shown that this model can be estimated from Markov chain Monte Carlo sampling methods.
Time-Dependent Topic Analysis with Endogenous and Exogenous Processes
A hierarchical Bayesian topic model is proposed which imposes a dynamic hierarchical structure on the evolution of topics incorporating the effects of exogenous processes, and this model can be estimated from Markov chain Monte Carlo sampling methods.
How to Make Causal Inferences Using Texts
A conceptual framework for making causal inferences with discovered measures as a treatment or outcome is introduced and this framework enables researchers to discover high-dimensional textual interventions and estimate the ways that observed treatments affect text-based outcomes.
Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach
Two improvements are proposed: first, OLS is replaced with more appropriate Beta regression, and a fully Bayesian approach is suggested instead of the current blending of frequentist and Bayesian methods.
Text-as-data methods for comparative policy analysis
Text-as-data approaches are becoming mainstream in political science, and allow researchers to conduct more efficiently research they have been doing and uncover new phenomena that previously remained hidden.
Inferring Concepts from Topics: Towards Procedures for Validating Topics as Measures
This prior work evaluates whether word sets learned by a topic model appear semantically related, but does not validate that the model captures the substantive quantity implied by the researchers’ topic label, so general tools to validate topics as measures are provided.


Finding scientific topics
  • T. Griffiths, M. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis
This paper presents theorems elucidating the posterior contraction rates of the topics as the amount of data increases, and a thorough supporting empirical study using synthetic and real data sets, including news and web-based articles and tweet messages.
A Method of Automated Nonparametric Content Analysis for Social Science
This work develops a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly, and illustrates with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency.
Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model
It is shown that the joint learning scheme of the sparse mixed-effects model improves on other state-of-the-art generative and discriminative models on the region and time period identification tasks and is more accurate quantitatively and qualitatively interesting.
A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases
A statistical model is introduced that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized by author to simultaneously estimate the topics in the texts, as well as the attention political actors allocate to the estimated topics.
A correlated topic model of Science
The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution, and it is demonstrated its use as an exploratory tool of large document collections.
How to Analyze Political Attention with Minimal Assumptions and Costs
Previous methods of analyzing the substance of political attention have had to make several restrictive assumptions or been prohibitively costly when applied to large-scale political texts. Here, we
Correlated Topic Models
The correlated topic model (CTM) is developed, where the topic proportions exhibit correlation via the logistic normal distribution and a mean-field variational inference algorithm is derived for approximate posterior inference in this model, which is complicated by the fact that the Logistic normal is not conjugate to the multinomial.
Multinomial Inverse Regression for Text Analysis
A straightforward framework of sentiment-sufficient dimension reduction for text data is introduced and it is shown that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information.
Probabilistic Topic Models
  • D. Blei
  • Computer Science
    IEEE Signal Processing Magazine
  • 2010
Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.