Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

@article{Painsky2017CrossValidatedVS,
  title={Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance},
  author={Amichai Painsky and Saharon Rosset},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2017},
  volume={39},
  pages={2142-2153}
}
  • Amichai Painsky, S. Rosset
  • Published 10 December 2015
  • Computer Science
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
Recursive partitioning methods producing tree-like models are a long standing staple of predictive modeling. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods’ inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. We propose a framework to… 

Figures and Tables from this paper

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection
TLDR
It is shown that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI, and by utilizing cross-validated (CV) unbiased base learners, this flaw is fixed at a relatively low computational cost.
Trees-Based Models for Correlated Data
TLDR
This paper presents a new approach for treesbased regression, such as simple regression tree, random forest and gradient boosting, in settings involving correlated data that explicitly takes the correlation structure into account in the splitting criterion, stopping rules and fitted values in the leaves.
Lossless Compression of Random Forests
TLDR
This work introduces a novel method for lossless compression of tree-based ensemble methods, focusing on random forests, based on probabilistic modeling of the ensemble’s trees, followed by model clustering via Bregman divergence.
Variable Selection for Discrimination between Low and High Yielding Populations of Indian Mustard
  • P. Godara
  • Biology
    International Journal of Pure & Applied Bioscience
  • 2019
TLDR
Three variable selection methods (Univariate t-test, Rao ́s F test for additional Information and Random Forests Algorithm) for classification and discrimination were used and compared and performance of the methods was assessed.
LIMITED METHOD FOR THE CASE OF ALGORITHMIC CLASSIFICATION TREE
  • I. Povhan
  • Computer Science
    Radio Electronics, Computer Science, Control
  • 2020
TLDR
The experiments carried out in the present work have proved the performance capabilities of the software suggested and demonstrate the possibility of its promising utilization for the solution of a wide spectrum of applied recognition/classification problems.
Model-based imputation of sound level data at thoroughfare using computational intelligence
  • M. Kekez
  • Computer Science
    Open Engineering
  • 2021
TLDR
The methodology of imputation of the missing sound level data, for a period of several months, in many noise monitoring stations located at thoroughfares by applying one model which describes variability of sound level within the tested period is presented.
A Random Forest Model Building Using A priori Information for Diagnosis
TLDR
The problem of inductive model building on precedents for biomedical applications is considered, and proposed method of random forest model building provide more accurate model saving general random character of a method.
THE METHOD OF BOUNDED CONSTRUCTIONS OF LOGICAL CLASSIFICATION TREES IN THE PROBLEM OF DISCRETE OBJECTS CLASSIFICATION
  • I. Povkhan
  • Computer Science
    Ukrainian Journal of Information Technology
  • 2021
TLDR
An effective scheme for recognizing discrete objects has been developed based on step-by-step evaluation and selection of sets of attributes based on selected paths in the classification tree structure at each stage of scheme synthesis.
On the Universality of the Logistic Loss Function
TLDR
This work shows that for binary classification problems, the divergence associated with smooth, proper and convex loss functions is bounded from above by the Kullback-Leibler (KL) divergence, up to a multiplicative normalization constant.
...
1
2
3
4
...

References

SHOWING 1-10 OF 34 REFERENCES
Unbiased Recursive Partitioning: A Conditional Inference Framework
TLDR
A unified framework for recursive partitioning is proposed which embeds tree-structured regression models into a well defined theory of conditional inference procedures and it is shown that the predicted accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection.
SPLIT SELECTION METHODS FOR CLASSIFICATION TREES
TLDR
This article presents an algorithm called QUEST that has negligible bias, which shares similarities with the FACT method, but it yields binary splits and the final tree can be selected by a direct stopping rule or by pruning.
REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION
TLDR
The proposed algorithm, GUIDE, is specifically designed to eliminate variable selection bias, a problem that can undermine the reliability of inferences from a tree structure and allows fast computation speed, natural ex- tension to data sets with categorical variables, and direct detection of local two- variable interactions.
Selecting multiway splits in decision trees
TLDR
A new criterion for model selection: a resampling estimate of the information gain is introduced, which generates multiway trees that are both smaller and more accurate than those produced previously, and their performance is comparable with standard binary decision trees.
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
TLDR
The results indicate that for real-word datasets similar to the authors', the best method to use for model selection is ten fold stratified cross validation even if computation power allows using more folds.
Classification and regression trees
  • W. Loh
  • Computer Science
    WIREs Data Mining Knowl. Discov.
  • 2011
TLDR
This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Fifty Years of Classification and Regression Trees
TLDR
This article surveys the developments and briefly reviews the key ideas behind some of the major algorithms in regression tree algorithms.
Using a Permutation Test for Attribute Selection in Decision Trees
TLDR
This work describes how permutation tests can be applied to the problem of attribute selection in decision trees, and gives a novel two-stage method for applying it to select attributes in a decision tree.
Tree-Structured Classification via Generalized Discriminant Analysis.
TLDR
A new method of tree-structured classification is obtained by recursive application of linear discriminant analysis, with the variables at each stage being appropriately chosen according to the data and the type of splits desired.
Bias in random forest variable importance measures: Illustrations, sources and a solution
TLDR
An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
...
1
2
3
4
...