Random forest missing data algorithms

  title={Random forest missing data algorithms},
  author={Fei Tang and Hemant Ishwaran},
  journal={Statistical Analysis and Data Mining: The ASA Data Science Journal},
  pages={363 - 377}
  • Fei TangH. Ishwaran
  • Published 19 January 2017
  • Computer Science
  • Statistical Analysis and Data Mining: The ASA Data Science Journal
Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF… 

missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data

This study aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy by using the BPSO-based feature selection step prior to imputing missing values with missForest, to show better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model

Simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms, and other methods are not valid to test when the missing pattern is informative.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR.

A survey on missing data in machine learning

This paper aggregate some of the literature on missing data particularly focusing on machine learning techniques, and proposes and evaluates two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm.

An Efficient and Effective Model to Handle Missing Data in Classification

It can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps and can be used even for datasets with 90 missing present.

A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees

This study evaluated the performance of the missing data approaches when data were missing at random or missing completely at random and found the proposed multiple imputation approach and the surrogate split approach had superior performance.

Variable selection with missing data in both covariates and outcomes: Imputation and machine learning

Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the F 1 score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings.

Strategy to Managing Mixed Datasets with Missing Items

Three missing data techniques: complete ignoring, case deletion, and random forest missing data imputation were applied to medical data of various types, under a missing completely at random assumption for solving classification task and softening the negative impact of input information uncertainty.

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data

It is suggested that denoising autoencoders may overfit the data leading to poor confounder control and use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.

Imputation and low-rank estimation with Missing Not At Random data

This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.



Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

Compared parametric MICE with a random forest-based MICE algorithm, random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

Comparison of imputation methods for missing laboratory data in medicine

MissForest is a highly accurate method of imputations for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Existing imputation methods for phenomic data are investigated, a novel concept of "imputability measure" (IM) is introduced to identify missing values that are fundamentally inadequate to impute and a self-training selection (STS) scheme to select the best imputation method is proposed.

Good methods for coping with missing data in decision trees

Multiple imputation of discrete and continuous data by fully conditional specification

  • S. van Buuren
  • Computer Science
    Statistical methods in medical research
  • 2007
FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified, and shows that FCS behaves very well in the cases studied.

Multiple Imputation After 18+ Years

A description of the assumed context and objectives of multiple imputation is provided, and a review of the multiple imputations framework and its standard results are reviewed.

Recursive partitioning for missing data imputation in the presence of interaction effects

Multivariate random forests

The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.

Missing value estimation methods for DNA microarrays

It is shown that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVD Impute and KNN Impute surpass the commonly used row average method (as well as filling missing values with zeros).

Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model

Simulation results suggest the imputation by fully conditional specification proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible.