• Corpus ID: 235352609

On Ensembling vs Merging: Least Squares and Random Forests under Covariate Shift

  title={On Ensembling vs Merging: Least Squares and Random Forests under Covariate Shift},
  author={Maya Ramchandran and Rajarshi Mukherjee},
It has been postulated and observed in practice that for prediction problems in which covariate data can be naturally partitioned into clusters, ensembling algorithms based on suitably aggregating models trained on individual clusters often perform substantially better than methods that ignore the clustering structure in the data. In this paper, we provide theoretical support to these empirical observations by asymptotically analyzing linear least squares and random forest regressions under a… 
1 Citations

Figures and Tables from this paper

Cross-Cluster Weighted Forests

It is found that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.



Sharp Analysis of a Simple Model for Random Forests

A historically important random forest model, where a feature is selected at random and the splits occurs at the midpoint of the node along the chosen feature, is revisited and it is shown that this rate cannot be improved in general.

Tree-Weighting for Multi-Study Ensemble Learners

Novel weighting approaches for constructing tree-based ensemble learners in this setting are considered that reward cross-study replicability within the training set and find that incorporating multiple layers of ensembling in the training process increases the robustness of the resulting predictor.

The Utility of Clustering in Prediction Tasks

The direct utility of using clustering to improve prediction accuracy is investigated and it is found that using this method improves upon the prediction of even a Random Forests predictor which suggests this method is providing a novel, and useful source of variance in the prediction process.

Analysis of a Random Forests Model

  • G. Biau
  • Computer Science
    J. Mach. Learn. Res.
  • 2012
An in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm, and shows in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present.

Classification and Regression by randomForest

random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.

A framework for simultaneous co-clustering and learning from complex data

A model-based co-clustering (meta)-algorithm that interleaves clustering and construction of prediction models to iteratively improve both cluster assignment and fit of the models is presented.

Prediction models for clustered data: comparison of a random intercept and standard regression model

The models with random intercept discriminate better than the standard model only if the cluster effect is used for predictions, and the prediction model withrandom intercept had good calibration within clusters.

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

This paper recovers-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.

A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data

Of the 120 cases studied using Support vector machines and K nearest neighbors as classifiers and Matthews correlation coefficient as performance metric, it is found that Ratio-G, Ratio-A, EJLR, mean-centering and standardization methods perform better or equivalent to no batch effect removal in 89, 85, 83, 79 and 75% of the cases, respectively, suggesting that the application of these methods is generally advisable and ratio-based methods are preferred.


A heuristic analysis is presented in this paper based on a simplified version of RF denoted RF0 that supports the empirical results from RF and illuminates why RF is able to handle large numbers of input variables and what the role of mtry is.