• Corpus ID: 221340686

Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale

  title={Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale},
  author={Ludmila I. Kuncheva and Clare E. Matthews and {\'A}lvar Arnaiz-Gonz{\'a}lez and Juan Jos{\'e} Rodr{\'i}guez Diez},
In classification problems, the purpose of feature selection is to identify a small, highly discriminative subset of the original feature set. In many applications, the dataset may have thousands of features and only a few dozens of samples (sometimes termed `wide'). This study is a cautionary tale demonstrating why feature selection in such cases may lead to undesirable results. In view to highlight the sample size issue, we derive the required sample size for declaring two features different… 

Figures and Tables from this paper

Unsupervised Adaptation for High-Dimensional with Limited-Sample Data Classification Using Variational Autoencoder

Experimental results demonstrate that variational autoencoder can achieve more accuracy than traditional dimensionality reduction techniques in high-dimensional with limited-sample-size data analysis.


A novel deep learning approach to feature selection that addresses both challenges simultaneously and discovers relevant features that provide superior prediction performance compared to the state-of-the-art benchmarks in practical scenarios where there is often limited labeled data and high correlations among features.

Benchmarking Feature Selection Methods in Radiomics

Analysis of variance, least absolute shrinkage and selection operator, and minimum redundancy, maximum relevance ensemble appear to be good choices for radiomic studies in terms of predictive performance, as they outperformed most other feature selection methods.

Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study

A comparative study focused on different combinations of feature selectors and classification learning algorithms to identify those with the best predictive capacity and proved that, for a given classification learning algorithm and dataset, all filters have a similar performance.

Measuring the bias of incorrect application of feature selection when using cross-validation in radiomics

Background Many studies in radiomics are using feature selection methods to identify the most predictive features. At the same time, they employ cross-validation to estimate the performance of the

Combining Genetic Algorithms and SVM for Breast Cancer Diagnosis Using Infrared Thermography

This work proposes an ensemble method for selecting models and features by combining a Genetic Algorithm and the Support Vector Machine (SVM) classifier to diagnose breast cancer.

A Comparative Study on the Potential of Unsupervised Deep Learning-based Feature Selection in Radiomics

It was found that deep learning-based feature selection leads to improved classification results compared to conventional methods, especially for small feature subsets.



A review of feature selection methods on synthetic data

Several synthetic datasets are employed for this purpose, aiming at reviewing the performance of feature selection methods in the presence of a crescent number or irrelevant features, noise in the data, redundancy and interaction between attributes, as well as a small ratio between number of samples and number of features.

Determining appropriate approaches for using data in feature selection

The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases.

What should be expected from feature selection in small-sample settings

These questions are addressed using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test and it is concluded that one cannot expect to find a feature set whose error is close to optimal.

The feature selection bias problem in relation to high-dimensional gene data

Many are called, but few are chosen. Feature selection and error estimation in high dimensional spaces

Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

This paper extends ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case and improves ISIS by allowing feature deletion in the iterative process.

Feature subset selection bias for classification learning

This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.

Improved multiclass feature selection via list combination