Random forest missing data algorithms
@article{Tang2017RandomFM, title={Random forest missing data algorithms}, author={Fei Tang and Hemant Ishwaran}, journal={Statistical Analysis and Data Mining: The ASA Data Science Journal}, year={2017}, volume={10}, pages={363 - 377} }
Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF…
315 Citations
missForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data
- Computer ScienceGenes & Genomics
- 2022
This study aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy by using the BPSO-based feature selection step prior to imputing missing values with missForest, to show better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.
The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
- Computer ScienceFrontiers in Public Health
- 2021
Simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms, and other methods are not valid to test when the missing pattern is informative.
Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction
- Computer ScienceBMC Medical Research Methodology
- 2020
RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR.
A survey on missing data in machine learning
- Computer ScienceJ. Big Data
- 2021
This paper aggregate some of the literature on missing data particularly focusing on machine learning techniques, and proposes and evaluates two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm.
An Efficient and Effective Model to Handle Missing Data in Classification
- Computer ScienceBioMed research international
- 2020
It can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps and can be used even for datasets with 90 missing present.
A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees
- Computer Science
- 2021
This study evaluated the performance of the missing data approaches when data were missing at random or missing completely at random and found the proposed multiple imputation approach and the surrogate split approach had superior performance.
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning
- Computer ScienceStatistical methods in medical research
- 2021
Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the F 1 score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings.
Strategy to Managing Mixed Datasets with Missing Items
- Computer ScienceIPMU
- 2018
Three missing data techniques: complete ignoring, case deletion, and random forest missing data imputation were applied to medical data of various types, under a missing completely at random assumption for solving classification task and softening the negative impact of input information uncertainty.
Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data
- Computer ScienceEpidemiology
- 2023
It is suggested that denoising autoencoders may overfit the data leading to poor confounder control and use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.
Imputation and low-rank estimation with Missing Not At Random data
- Computer ScienceStatistics and Computing
- 2020
This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.
References
SHOWING 1-10 OF 27 REFERENCES
Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
- Computer ScienceAmerican journal of epidemiology
- 2014
Compared parametric MICE with a random forest-based MICE algorithm, random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
Comparison of imputation methods for missing laboratory data in medicine
- MedicineBMJ Open
- 2013
MissForest is a highly accurate method of imputations for missing laboratory data and outperforms other common imputation techniques in terms of imputation error and maintenance of predictive ability with imputed values in two clinical predicative models.
Missing value imputation in high-dimensional phenomic data: imputable or not, and how?
- Computer ScienceBMC Bioinformatics
- 2014
Existing imputation methods for phenomic data are investigated, a novel concept of "imputability measure" (IM) is introduced to identify missing values that are fundamentally inadequate to impute and a self-training selection (STS) scheme to select the best imputation method is proposed.
Good methods for coping with missing data in decision trees
- Computer SciencePattern Recognit. Lett.
- 2008
Multiple imputation of discrete and continuous data by fully conditional specification
- Computer ScienceStatistical methods in medical research
- 2007
FCS is a useful and easily applied flexible alternative to JM when no convenient and realistic joint distribution can be specified, and shows that FCS behaves very well in the cases studied.
Multiple Imputation After 18+ Years
- Computer Science
- 1996
A description of the assumed context and objectives of multiple imputation is provided, and a review of the multiple imputations framework and its standard results are reviewed.
Recursive partitioning for missing data imputation in the presence of interaction effects
- Computer ScienceComput. Stat. Data Anal.
- 2014
Multivariate random forests
- Computer ScienceWiley Interdiscip. Rev. Data Min. Knowl. Discov.
- 2011
The genesis of, and motivation for, the random forest paradigm as an outgrowth from earlier tree‐structured techniques is outlined and an illustrative example from ecology is provided that showcases the improved fit and enhanced interpretation afforded by the random Forest framework.
Missing value estimation methods for DNA microarrays
- Computer ScienceBioinform.
- 2001
It is shown that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVD Impute and KNN Impute surpass the commonly used row average method (as well as filling missing values with zeros).
Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model
- Computer Science, EconomicsStatistical methods in medical research
- 2015
Simulation results suggest the imputation by fully conditional specification proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible.