Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

@article{Mentch2021BridgingBB,
  title={Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning},
  author={Lucas Mentch and Giles Hooker},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.12328}
}
In 2001, Leo Breiman wrote of a divide between “data modeling” and “algorithmic modeling” cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the “data modelers” incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman’s own Random Forest methods. While this can be… 
Scientific Inference With Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena
TLDR
A phenomenon-centric approach to IML in science clarifies the opportunities and limitations of IML for inference; that conditional not marginal sampling is required; and, the conditions under which the authors can trust IML methods.
Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods
TLDR
Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors, is introduced.
A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds
TLDR
A sharp squared error generalization lower bound is proved for a large class of decision tree algorithmstted to sparse additive models with C 1 component functions, and a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory is established.

References

SHOWING 1-10 OF 78 REFERENCES
Modeling Avian Full Annual Cycle Distribution and Population Trends with Citizen Science Data
TLDR
An analytical framework is presented to address challenges and generate year-round, range-wide distributional information using citizen science data and is the first example of an analysis to capture intra‐ and inter-annual distributional dynamics across the entire range of a broadly distributed, highly mobile species.
Generalized random forests
TLDR
A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
  • Stefan Wager, S. Athey
  • Mathematics, Computer Science
    Journal of the American Statistical Association
  • 2018
TLDR
This is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference and is found to be substantially more powerful than classical methods based on nearest-neighbor matching.
Greedy function approximation: A gradient boosting machine.
TLDR
A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Random Forests
TLDR
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Analysis of a Random Forests Model
  • G. Biau
  • Computer Science
    J. Mach. Learn. Res.
  • 2012
TLDR
An in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm, and shows in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.
BART: Bayesian Additive Regression Trees
We develop a Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian
Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
TLDR
This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles by considering predicting by averaging over trees built on subsamples of the training set and demonstrating that the resulting estimator takes the form of a U-statistic.
Generalized Functional ANOVA Diagnostics for High-Dimensional Functions of Dependent Variables
TLDR
A weighted functional ANOVA that controls for the effect of dependence between input variables and is demonstrated in the context of machine learning in which the possibility of poor extrapolation makes it important to restrict attention to regions of high data density.
Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee
TLDR
The empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.
...
...