Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

  title={Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches},
  author={Elizabeth A. Handorf and Yinuo Yin and Michael J. Slifker and Shannon M. Lynch},
  journal={BMC Medical Research Methodology},
Background Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome… 


A Neighborhood-Wide Association Study (NWAS): Example of prostate cancer aggressiveness
Although NWAS requires further testing, it is hypothesis-generating and addresses gaps in geospatial analysis related to empiric assessment and could have broad implications for many diseases and future precision medicine studies focused on multilevel risk factors of disease.
Individual- and Neighborhood-Level Predictors of Mortality in Florida Colorectal Cancer Patients
Factors associated with increased risk for mortality among individuals with CRC included being older, uninsured, unmarried, more comorbidities, living in lower SES neighborhoods, and diagnosed at later disease stage.
Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival
This work proposes a subsampling approach that can be used to estimate the variance of VIMP and for constructing confidence intervals and finds that the delete-d jackknife variance estimator, a close cousin, is especially effective under low subsampled rates due to its bias correction properties.
Prostate Cancer Severity Associations with Neighborhood Deprivation
Using a neighborhood deprivation index, this paper observed associations between high-grade prostate cancer and neighborhood deprivation in Caucasians and African-Americans.
Model selection and estimation in regression with grouped variables
Summary.  We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor
An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus
Despite difficulty in ascertaining causality, the potential for novel factors of large effect associated with T2D justify the use of EWAS to create hypotheses regarding the broad contribution of the environment to disease.
Evaluation of a two-step iterative resampling procedure for internal validation of genome-wide association studies
The TSIR approach with an at least 70:30 split and a cutoff of discovering and replicating SNPs at least 20 times in 100 replications provides conservative type I error control and has near ‘optimal’ power for internally validated SNPs.
Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies
A probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation and introduces a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays.
Reclassification of genetic-based risk predictions as GWAS data accumulate
The large amount of reclassification that is demonstrated in individuals initially classified as Higher Risk but later as Average Risk or Lower Risk, suggests that caution is currently warranted in basing clinical decisions on common genetic variation for many complex diseases.
Statistical Learning with Sparsity: The Lasso and Generalizations
Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data and extract useful and reproducible patterns from big datasets.