Boruta - A System for Feature Selection

  title={Boruta - A System for Feature Selection},
  author={Miron Bartosz Kursa and Aleksander Jankowski and Witold R. Rudnicki},
  journal={Fundam. Informaticae},
Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. [] Key Method It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the…

Figures and Tables from this paper

The All Relevant Feature Selection using Random Forest

The relevance of nearly all previously established important genes was confirmed, moreover the relevance of several new ones is discovered and the procedure is tested using a well-known gene expression data set.

All Relevant Feature Selection Methods and Applications

The problem of all-relevant feature selection is first defined, then key algorithms are described, and the Boruta algorithm is explained in a greater detail and applied both to a collection of synthetic and real-world data sets.

Chapter 2 All Relevant Feature Selection Methods and Applications

The problem of all-relevant feature selection is first defined, then key algorithms are described, and the Boruta algorithm is explained in a greater detail and applied both to a collection of synthetic and real-world data sets.

Embedded all relevant feature selection with Random Ferns

The idea of incorporating all relevant selection within the training process by producing importance for implicitly generated shadows, attributes irrelevant by design is investigated and a method in context of random ferns classifier is proposed and evaluated.

A Deceiving Charm of Feature Selection: The Microarray Case Study

This paper presents a reanalysis of a previously published late radiation toxicity prediction problem and shows how futile it may be to rely on non-validated feature selection and how even advanced algorithms fail to distinguish between noise and signal when the latter is weak.

Generational Feature Elimination and Some Other Ranking Feature Selection Methods

This chapter presents a newly implemented method, called Generational Feature Elimination (GFE), based on decision tree models, which is based on feature occurrences at given levels inside decision trees created in subsequent generations.

MDFS - MultiDimensional Feature Selection

An R package MDFS (MultiDimensional Feature Selection) is presented that performs identification of informative variables taking into account synergistic interactions between multiple descriptors and the decision variable.

Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery

Several data-driven thresholds to automatically identify the relevant features in an ensemble feature selector are developed and evaluated and are applied to data from two real-world Alzheimer's disease studies.



A Statistical Method for Determining Importance of Variables in an Information System

The method was shown to be more reliable than that based on standard application of a random forest to assess attributes' importance and to be applied to 12 data sets of biological origin.

Gene selection and classification of microarray data using random forest

It is shown that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Bias in random forest variable importance measures: Illustrations, sources and a solution

An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.

Random Forests

Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

Classification and Regression by randomForest

random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.

Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling

It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

Conditional variable importance for random forests

A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.

Identifying SNPs predictive of phenotype using random forests

This work extends the concept of importance to pairs of predictors, to capture joint effects, and explores the behavior of importance measures over a range of two‐locus disease models in the presence of a varying number of SNPs unassociated with the phenotype.

Neural Networks for Pattern Recognition