Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

@article{Mentch2021BridgingBB,
  title={Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning},
  author={Lucas Mentch and Giles Hooker},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.12328}
}
In 2001, Leo Breiman wrote of a divide between “data modeling” and “algorithmic modeling” cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the “data modelers” incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman’s own Random Forest methods. While this can be… 
A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds
TLDR
A sharp squared error generalization lower bound is proved for a large class of decision tree algorithms fitted to sparse additive models with C component functions, and a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory is established.
Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods
TLDR
Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors, is introduced.

References

SHOWING 1-10 OF 79 REFERENCES
Generalized random forests
TLDR
A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.
Modeling Avian Full Annual Cycle Distribution and Population Trends with Citizen Science Data
TLDR
An analytical framework is presented to address challenges and generate year-round, range-wide distributional information using citizen science data and is the first example of an analysis to capture intra‐ and inter-annual distributional dynamics across the entire range of a broadly distributed, highly mobile species.
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
  • Stefan Wager, S. Athey
  • Mathematics, Computer Science
    Journal of the American Statistical Association
  • 2018
TLDR
This is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference and is found to be substantially more powerful than classical methods based on nearest-neighbor matching.
Analysis of a Random Forests Model
  • G. Biau
  • Computer Science
    J. Mach. Learn. Res.
  • 2012
TLDR
An in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm, and shows in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.
BART: Bayesian Additive Regression Trees
We develop a Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian
Random Forests
TLDR
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Greedy function approximation: A gradient boosting machine.
TLDR
A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
TLDR
This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles by considering predicting by averaging over trees built on subsamples of the training set and demonstrating that the resulting estimator takes the form of a U-statistic.
Generalized Functional ANOVA Diagnostics for High-Dimensional Functions of Dependent Variables
TLDR
A weighted functional ANOVA that controls for the effect of dependence between input variables and is demonstrated in the context of machine learning in which the possibility of poor extrapolation makes it important to restrict attention to regions of high data density.
A Unified Framework for Random Forest Prediction Error Estimation
TLDR
A unified framework for random forest prediction error estimation based on a novel estimator of the conditional prediction error distribution function is introduced, and it is shown via simulations that the proposed prediction intervals are competitive with, and in some settings outperform, existing methods.
...
1
2
3
4
5
...