Navigating the Local Modes of Big Data: The Case of Topic Models

  title={Navigating the Local Modes of Big Data: The Case of Topic Models},
  author={Margaret E. Roberts and Brandon M Stewart and Dustin Tingley},
  booktitle={Computational Social Science},
INTRODUCTION Each day humans generate massive volumes of data in a variety of different forms (Lazer et al., 2009). For example, digitized texts provide a rich source of political content through standard media sources such as newspapers, as well as newer forms of political discourse such as tweets and blog posts. In this chapter we analyze a corpus of 13,246 posts that were written for six political blogs during the course of the 2008 U.S. presidential election. But this is just one small… 

Figures and Tables from this paper

Exploring Thematic Diversity In News Coverage And Social Media Activity Of Political Candidates Using Unsupervised Machine Learning
This study systematically explores the relationship between electoral success of political candidates and the volume and tone of their news coverage and social media activity, and the independent (or dependent) nature of these media features.
A Model of Text for Experimentation in the Social Sciences
A hierarchical mixed membership model for analyzing topical content of documents, in which mixing weights are parameterized by observed covariates is posit, enabling researchers to introduce elements of the experimental design that informed document collection into the model, within a generally applicable framework.
A Robust Latent Dirichlet Allocation Approach for the Study of Political Text
This work proposes an approach to using topic model results for hypothesis testing that incorporates information from multiple specifications, and illustrates this robust approach by replicating an influential political science study.
A Full-Cycle Methodology for News Topic Modeling and User Feedback Research
A full-cycle methodology for online news analysis: from choosing the optimal topic number to the extraction of stable topics and analysis of TM results, illustrated with an analysis of online news stream of 164,426 messages formed by twelve national TV channels during a one-year period in a leading Russian OSN.
Political Opinion Formation as Epistemic Practice: The Hashtag Assemblage of #metwo
The article contributes to the literature on the political use of hashtags. We argue that hashtag assemblages could be understood in the tradition of representing public opinion through datafication
Human computation scaling for measuring meaningful latent traits in political texts∗
An innovative “human computation” method for encoding political texts that preserves much of the reliability of automated methods while leveraging the superior ability of humans to read and understand natural language is developed and validated.
Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges
This article first describes the four stages of a typical text-as-data project, then reviews recent political science applications and explores one important methodological challenge—topic model instability—in greater detail.
Towards Cultural-Scale Models of Full Text
This study test the sensitivity of the topic models to the sampling process by taking random samples of books in the Hathi Trust Digital Library within different Library of Congress Classification (LCC) areas and finds that sample models with a large sample size typically have an alignment distance that falls in the range of the alignment distance between spanning models.
India nudges to contain COVID-19 pandemic: A reactive public policy analysis using machine-learning based topic modelling
Investigation of how government formed reactive policies to fight coronavirus across its policy sectors found that nudges from the Prime Minister of India was critical in creating herd effect on lockdown and social distancing norms across the nation.
Commenting on poverty online: A corpus-assisted discourse study of the Suomi24 forum
This paper brings new insight to poverty and social exclusion through an analysis of how poverty-related issues are commented on in the largest online discussion forum in Finland: Suomi24


A Method of Automated Nonparametric Content Analysis for Social Science
This work develops a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly, and illustrates with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency.
Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
Politics and political conflict often occur in the written and spoken word. Scholars have long recognized this, but the massive costs of analyzing even moderately sized collections of texts have
Scaling Politically Meaningful Dimensions Using Texts and Votes
A new approach to using sources of metadata about votes to estimate the degree to which those votes are about common issues, using latent Dirichlet allocation to discover the extent to which different issues were at stake in different cases and estimating justice preferences within each of those issues is proposed.
A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases
A statistical model is introduced that attends to the structure of political rhetoric when measuring expressed priorities: statements are naturally organized by author to simultaneously estimate the topics in the texts, as well as the attention political actors allocate to the estimated topics.
Learning to Extract International Relations from Political Context
A new probabilistic model for extracting events between major political actors from news corpora by bringing together familiar components in natural language processing with contextual political information— temporal and dyad dependence—to infer latent event classes is described.
How to Analyze Political Attention with Minimal Assumptions and Costs
Previous methods of analyzing the substance of political attention have had to make several restrictive assumptions or been prohibitively costly when applied to large-scale political texts. Here, we
Learning Topic Models -- Going beyond SVD
This paper formally justifies Nonnegative Matrix Factorization (NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative, and gives the first polynomial-time algorithm for learning topic models without the above two limitations.
Finding scientific topics
  • T. GriffithsM. Steyvers
  • Computer Science
    Proceedings of the National Academy of Sciences of the United States of America
  • 2004
A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Optimizing Semantic Coherence in Topic Models
A novel statistical topic model based on an automated evaluation metric based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
High-Dimensional Methods and Inference on Structural and Treatment Effects
Using scanner datasets that record transaction-level data for households across a wide range of products, or text data where counts of words in documents may be wide range to text data, researchers are faced with a large set of potential variables formed by different ways of interacting and transforming the underlying variables.