Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance
@article{Painsky2017CrossValidatedVS, title={Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance}, author={Amichai Painsky and Saharon Rosset}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2017}, volume={39}, pages={2142-2153} }
Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods’ inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to…
Figures and Tables from this paper
32 Citations
Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection
- Computer ScienceEntropy
- 2022
It is shown that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI, and by utilizing cross-validated (CV) unbiased base learners, this flaw is fixed at a relatively low computational cost.
Decision tree underfitting in mining of gene expression data. An evolutionary multi-test tree approach
- Computer ScienceExpert Syst. Appl.
- 2019
Trees-Based Models for Correlated Data
- Computer Science
- 2021
This paper presents a new approach for treesbased regression, such as simple regression tree, random forest and gradient boosting, in settings involving correlated data that explicitly takes the correlation structure into account in the splitting criterion, stopping rules and fitted values in the leaves.
Lossless Compression of Random Forests
- Computer ScienceJournal of Computer Science and Technology
- 2019
This work introduces a novel method for lossless compression of tree-based ensemble methods, focusing on random forests, based on probabilistic modeling of the ensemble’s trees, followed by model clustering via Bregman divergence.
Variable Selection for Discrimination between Low and High Yielding Populations of Indian Mustard
- BiologyInternational Journal of Pure & Applied Bioscience
- 2019
Three variable selection methods (Univariate t-test, Rao ́s F test for additional Information and Random Forests Algorithm) for classification and discrimination were used and compared and performance of the methods was assessed.
LIMITED METHOD FOR THE CASE OF ALGORITHMIC CLASSIFICATION TREE
- Computer ScienceRadio Electronics, Computer Science, Control
- 2020
The experiments carried out in the present work have proved the performance capabilities of the software suggested and demonstrate the possibility of its promising utilization for the solution of a wide spectrum of applied recognition/classification problems.
Model-based imputation of sound level data at thoroughfare using computational intelligence
- Computer ScienceOpen Engineering
- 2021
The methodology of imputation of the missing sound level data, for a period of several months, in many noise monitoring stations located at thoroughfares by applying one model which describes variability of sound level within the tested period is presented.
A Random Forest Model Building Using A priori Information for Diagnosis
- Computer ScienceCMIS
- 2019
The problem of inductive model building on precedents for biomedical applications is considered, and proposed method of random forest model building provide more accurate model saving general random character of a method.
THE METHOD OF BOUNDED CONSTRUCTIONS OF LOGICAL CLASSIFICATION TREES IN THE PROBLEM OF DISCRETE OBJECTS CLASSIFICATION
- Computer ScienceUkrainian Journal of Information Technology
- 2021
An effective scheme for recognizing discrete objects has been developed based on step-by-step evaluation and selection of sets of attributes based on selected paths in the classification tree structure at each stage of scheme synthesis.
On the Universality of the Logistic Loss Function
- Computer Science2018 IEEE International Symposium on Information Theory (ISIT)
- 2018
This work shows that for binary classification problems, the divergence associated with smooth, proper and convex loss functions is bounded from above by the Kullback-Leibler (KL) divergence, up to a multiplicative normalization constant.
References
SHOWING 1-10 OF 34 REFERENCES
Unbiased Recursive Partitioning: A Conditional Inference Framework
- Computer Science
- 2006
A unified framework for recursive partitioning is proposed which embeds tree-structured regression models into a well defined theory of conditional inference procedures and it is shown that the predicted accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection.
SPLIT SELECTION METHODS FOR CLASSIFICATION TREES
- Computer Science
- 1997
This article presents an algorithm called QUEST that has negligible bias, which shares similarities with the FACT method, but it yields binary splits and the final tree can be selected by a direct stopping rule or by pruning.
REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION
- Computer Science
- 2002
The proposed algorithm, GUIDE, is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure and allows fast computation speed, natural ex- tension to data sets with categorical variables, and direct detection of local two- variable interactions.
Selecting multiway splits in decision trees
- Computer Science
- 1996
A new criterion for model selection: a resampling estimate of the information gain is introduced, which generates multiway trees that are both smaller and more accurate than those produced previously, and their performance is comparable with standard binary decision trees.
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
- Computer ScienceIJCAI
- 1995
The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
Classification and regression trees
- Computer ScienceWIREs Data Mining Knowl. Discov.
- 2011
This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Fifty Years of Classification and Regression Trees
- Computer Science
- 2014
This article surveys the developments and briefly reviews the key ideas behind some of the major algorithms in regression tree algorithms.
Using a Permutation Test for Attribute Selection in Decision Trees
- Computer Science, MathematicsICML
- 1998
This work describes how permutation tests can be applied to the problem of attribute selection in decision trees, and gives a novel two-stage method for applying it to select attributes in a decision tree.
Tree-Structured Classification via Generalized Discriminant Analysis.
- Computer Science
- 1988
A new method of tree-structured classification is obtained by recursive application of linear discriminant analysis, with the variables at each stage being appropriately chosen according to the data and the type of splits desired.
Bias in random forest variable importance measures: Illustrations, sources and a solution
- Computer ScienceBMC Bioinformatics
- 2006
An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.