Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning
@article{Mentch2021BridgingBB, title={Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning}, author={Lucas Mentch and Giles Hooker}, journal={ArXiv}, year={2021}, volume={abs/2102.12328} }
In 2001, Leo Breiman wrote of a divide between “data modeling” and “algorithmic modeling” cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the “data modelers” incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman’s own Random Forest methods. While this can be…
2 Citations
A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds
- Computer ScienceAISTATS
- 2022
A sharp squared error generalization lower bound is proved for a large class of decision tree algorithms fitted to sparse additive models with C component functions, and a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory is established.
Hierarchical Shrinkage: improving the accuracy and interpretability of tree-based methods
- Computer ScienceArXiv
- 2022
Hierarchical Shrinkage (HS), a post-hoc algorithm that does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors, is introduced.
References
SHOWING 1-10 OF 79 REFERENCES
Generalized random forests
- Computer Science, MathematicsThe Annals of Statistics
- 2019
A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.
Modeling Avian Full Annual Cycle Distribution and Population Trends with Citizen Science Data
- Environmental SciencebioRxiv
- 2019
An analytical framework is presented to address challenges and generate year-round, range-wide distributional information using citizen science data and is the first example of an analysis to capture intra‐ and inter-annual distributional dynamics across the entire range of a broadly distributed, highly mobile species.
Estimation and Inference of Heterogeneous Treatment Effects using Random Forests
- Mathematics, Computer ScienceJournal of the American Statistical Association
- 2018
This is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference and is found to be substantially more powerful than classical methods based on nearest-neighbor matching.
Analysis of a Random Forests Model
- Computer ScienceJ. Mach. Learn. Res.
- 2012
An in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm, and shows in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.
BART: Bayesian Additive Regression Trees
- Computer Science
- 2010
We develop a Bayesian "sum-of-trees" model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian…
Random Forests
- Computer ScienceMachine Learning
- 2004
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Greedy function approximation: A gradient boosting machine.
- Computer Science
- 2001
A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2016
This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles by considering predicting by averaging over trees built on subsamples of the training set and demonstrating that the resulting estimator takes the form of a U-statistic.
Generalized Functional ANOVA Diagnostics for High-Dimensional Functions of Dependent Variables
- Computer Science
- 2007
A weighted functional ANOVA that controls for the effect of dependence between input variables and is demonstrated in the context of machine learning in which the possibility of poor extrapolation makes it important to restrict attention to regions of high data density.
A Unified Framework for Random Forest Prediction Error Estimation
- Computer ScienceJ. Mach. Learn. Res.
- 2021
A unified framework for random forest prediction error estimation based on a novel estimator of the conditional prediction error distribution function is introduced, and it is shown via simulations that the proposed prediction intervals are competitive with, and in some settings outperform, existing methods.