When 4 ≈ 10,000: The Power of Social Science Knowledge in Predictive Performance

  title={When 4 ≈ 10,000: The Power of Social Science Knowledge in Predictive Performance},
  author={Stephen McKay},
  • S. McKay
  • Published 1 January 2019
  • Computer Science
  • Socius
Computer science has devised leading methods for predicting variables; can social science compete? The author sets out a social scientific approach to the Fragile Families Challenge. Key insights included new variables constructed according to theory (e.g., a measure of shame relating to hardship), lagged values of the target variables, using predicted values of certain outcomes to inform others, and validated scales rather than individual variables. The models were competitive: a four-variable… 

Figures and Tables from this paper

Integrating Computer Prediction Methods in Social Science: A Comment on Hofman et al. (2021)
Machine learning and other computer-driven prediction models are one of the fastest growing trends in computational social science. These methods and approaches were developed in computer science and
Successes and Struggles with Computational Reproducibility: Lessons from the Fragile Families Challenge
The authors describe their approach to enabling computational reproducibility for the 12 articles in this special issue of Socius about the Fragile Families Challenge, and draw on two tools commonly used by professional software engineers but not widely used by academic researchers: software containers and cloud computing.
Introduction to the Special Collection on the Fragile Families Challenge
The Fragile Families Challenge is a scientific mass collaboration designed to measure and understand the predictability of life trajectories. Participants in the Challenge created predictive models
Special Collection: Fragile Families Challenge
  • 2020


Conditional variable importance for random forests
A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
This book is a valuable resource, both for the statistician needing an introduction to machine learning and related Ž elds and for the computer scientist wishing to learn more about statistics, and statisticians will especially appreciate that it is written in their own language.
Variable Importance Assessment in Regression: Linear Regression versus Random Forest
This article compares the two approaches (linear model on the one hand and two versions of random forests on the other hand) and finds both striking similarities and differences, some of which can be explained whereas others remain a challenge.
Bias in random forest variable importance measures: Illustrations, sources and a solution
An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
Logistic Regression in Rare Events Data
It is shown that more efficient sampling designs exist for making valid inferences, such as sampling all available events and a tiny fraction of nonevents, which enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.
Replication Standards for Quantitative Social Science
The credibility of quantitative social science benefits from policies that increase confidence that results reported by one researcher can be verified by others. Concerns about replicability have
An empirical analysis of journal policy effectiveness for computational reproducibility
This work evaluates the effectiveness of journal policy that requires the data and code necessary for reproducibility be made available postpublication by the authors upon request and finds it to be an improvement over no policy, but currently insufficient for reproducecibility.
MICE: Multivariate Imputation by Chained Equations in R
Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.
The Ladder: A Reliable Leaderboard for Machine Learning Competitions
This work introduces a notion of leaderboard accuracy tailored to the format of a competition called the Ladder and demonstrates that it simultaneously supports strong theoretical guarantees in a fully adaptive model of estimation, withstands practical adversarial attacks, and achieves high utility on real submission files from an actual competition hosted by Kaggle.
A review of methods for the assessment of prediction errors in conservation presence/absence models
Thirteen recommendations are made to enable the objective selection of an error assessment technique for ecological presence/absence models and a new approach to estimating prediction error, which is based on the spatial characteristics of the errors, is proposed.