• Corpus ID: 35078977

Exploratory Data Analysis using Random Forests ∗

  title={Exploratory Data Analysis using Random Forests ∗},
  author={Zachary Mark Jones and Fridolin Linder},
Although the rise of "big data" has made machine learning algorithms more visible and relevant for social scientists, they are still widely considered to be "black box" models that are not well suited for substantive research: only prediction. We argue that this need not be the case, and present one method, Random Forests, with an emphasis on its practical application for exploratory analysis and substantive interpretation. Random Forests detect interaction and nonlinearity without… 

Figures from this paper

edarf: Exploratory Data Analysis using Random Forests
This package contains functions useful for exploratory data analysis using random forests, which can be fit using the randomForest, randomForestSRC, or party packages (Liaw and Wiener 2002; Ishwaran
Tree-based machine learning methods for survey research
An introduction to prominent tree-based machine learning methods is provided and the usage of these techniques in the context of modeling and predicting nonresponse in panel surveys is exemplified.
Big Data Analytics for Long-Term Meteorological Observations at Hanford Site
This work offers a practical guide to analyzing long-term meteorological data and highlights the integration of ML and classical statistics to applied risk and decision science.
Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data
This article compares the performance of Random Forests with three versions of logistic regression, and finds that the algorithmic approach provides significantly more accurate predictions of civil war onset in out-of-sample data than any of theLogistic regression models.
This paper will show how to perform EDA by utilizing Python and R programming, and a good tool to simplify the analysis is also a consideration.
Modelling the impact of the urban spatial structure on the choice of residential location using ‘big earth data’ and machine learning
This study presents an exploratory data analysis approach to study physical characteristics in different living environments based on a large number of variables derived from spatial data such as satellites, OpenStreetMap and statistical data.
A combined approach for analysing heuristic algorithms
Two approaches for analysing algorithm parameters and components—functional analysis of variance and multilevel regression analysis—and the benefits of using both approaches jointly are considered and a combined methodology is presented that is able to provide more insights than when the two approaches are used separately.
Random forests as cumulative effects models: A case study of lakes and rivers in Muskoka, Canada.
Machine Learning for Solar Accessibility: Implications for Low-Income Solar Expansion and Profitability
This work uses electricity utility repayment as a proxy for solar installation repayment, and finds that shifting from a FICO score cutoff to the machine learning model increases profits by 34% to 1882% depending on the stringency used for evaluating potential customers.


Understanding Random Forests: From Theory to Practice
The goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability.
An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high-dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Conditional variable importance for random forests
A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.
Bias in random forest variable importance measures: Illustrations, sources and a solution
An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
Ensemble Trees and CLTs: Statistical Inference for Supervised Learning
This paper develops formal statistical inference procedures for machine learning ensemble methods by considering predicting by averaging over trees built on subsamples of the training set and demonstrating that the resulting estimator takes the form of a U-statistic.
Unbiased Recursive Partitioning: A Conditional Inference Framework
A unified framework for recursive partitioning is proposed which embeds tree-structured regression models into a well defined theory of conditional inference procedures and it is shown that the predicted accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection.
Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach
It is argued that the KRLS method is well-suited for social science inquiry because it avoids strong parametric assumptions, yet allows interpretation in ways analogous to generalized linear models while also permitting more complex interpretation to examine nonlinearities, interactions, and heterogeneous effects.
Beyond linearity by default: Generalized additive models
Social scientists almost always use statistical models positing the dependent variable as a global, linear function of X, despite suspicions that the social and political world is not so simple, or
Discovering additive structure in black box functions
This paper presents a method that seeks not to display the behavior of a function, but to evaluate the importance of non-additive interactions within any set of variables, and displays of the output as a graphical model of the function for interpretation purposes.
Do we need hundreds of classifiers to solve real world classification problems?
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in theTop-20, respectively).