• Corpus ID: 234742642

Cross-Cluster Weighted Forests

  title={Cross-Cluster Weighted Forests},
  author={Maya Ramchandran and Rajarshi Mukherjee and Giovanni Parmigiani},
Adapting machine learning algorithms to better handle the presence of natural clustering or batch effects within training datasets is imperative across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a single dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant… 

Figures and Tables from this paper


Tree-Weighting for Multi-Study Ensemble Learners
Novel weighting approaches for constructing tree-based ensemble learners in this setting are considered that reward cross-study replicability within the training set and find that incorporating multiple layers of ensembling in the training process increases the robustness of the resulting predictor.
On Ensembling vs Merging: Least Squares and Random Forests under Covariate Shift
It has been postulated and observed in practice that for prediction problems in which covariate data can be naturally partitioned into clusters, ensembling algorithms based on suitably aggregating
Cluster Forests
Inspired by Random Forests (RF) in the context of classification, we propose a new clustering ensemble method—Cluster Forests (CF). Geometrically, CF randomly probes a high-dimensional data cloud to
Unsupervised Learning With Random Forest Predictors
A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can
K-Random Forests: a K-means style algorithm for Random Forest clustering
  • M. Bicego
  • Computer Science
    2019 International Joint Conference on Neural Networks (IJCNN)
  • 2019
The proposed scheme, which is called K-Random Forests (K-RF), has been evaluated on five datasets and suggests that it represents a valid alternative to classic Random Forest clustering algorithms as well as to other established clustering approaches.
Empirical characterization of random forest variable importance measures
The RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables.
The Utility of Clustering in Prediction Tasks
The direct utility of using clustering to improve prediction accuracy is investigated and it is found that using this method improves upon the prediction of even a Random Forests predictor which suggests this method is providing a novel, and useful source of variance in the prediction process.
A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data
Of the 120 cases studied using Support vector machines and K nearest neighbors as classifiers and Matthews correlation coefficient as performance metric, it is found that Ratio-G, Ratio-A, EJLR, mean-centering and standardization methods perform better or equivalent to no batch effect removal in 89, 85, 83, 79 and 75% of the cases, respectively, suggesting that the application of these methods is generally advisable and ratio-based methods are preferred.
Random Forests
  • L. Breiman
  • Mathematics, Computer Science
    Machine Learning
  • 2004
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Classification and Regression by randomForest
random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.