State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues

  title={State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues},
  author={Willi Sauerbrei and Aris Perperoglou and Matthias Schmid and Michał Abrahamowicz and Heiko Becher and Harald Binder and Daniela Dunkler and Frank E. Harrell and Patrick Royston and Georg Heinze and Michal Heiko Harald Daniela Frank Georg Aris Geraldine Pa Abrahamowicz Becher Binder Dunkler Harrell Heinze and Michał Abrahamowicz and Heiko Becher and Harald Binder and Daniela Dunkler and Frank Harrell and Georg Heinze and Aris Perperoglou and G{\'e}raldine Rauch and Patrick Royston and Willi Sauerbrei},
  journal={Diagnostic and Prognostic Research},
Background How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful… 
Variable Selection and Redundancy in Multivariate Regression Models
Three other main objectives are presented: i) to eliminate variables that are not relevant; ii) to return a small subset of variables that has the same or better prediction performance as a model with all original variables; and iii) to investigate the consistency of these small subsets.
Using Background Knowledge from Preceding Studies for Building a Random Forest Prediction Model: A Plasmode Simulation Study
This paper examines the usefulness of external information from prior variable selection studies that used traditional statistical modeling approaches such as the Lasso, or suboptimal methods such as univariate selection, and recommends appraising the methodological quality of studies that serve as an external information source for future prediction model development.
The roles of predictors in cardiovascular risk models - a question of modeling culture?
Predictor-risk relations from ML models may differ from those obtained by statistical models, even with large sample sizes, and predictors may assume different roles in risk prediction models.
Review of guidance papers on regression modeling in statistical series of medical journals
Assessment of the current level of knowledge with regard to regression modeling contained in statistical papers found many misconceptions or misleading recommendations, but relevant gaps were identified with respect to addressing nonlinear effects of continuous predictors, model specification and variable selection.
Comparison of model-building strategies for excess hazard regression models in the context of cancer epidemiology
The results from extensive simulations evaluating varying model complexity and sample sizes provide guidelines on a model selection strategy in the context of excess hazard modelling.
Systematic review of education and practical guidance on regression modeling for medical researchers who lack a strong statistical background: Study protocol
This review will provide a basis for future guidance papers and tutorials in the field of regression modeling which will enable medical researchers to interpret publications in a correct way, to perform basic statistical analyses in acorrect way and to identify situations when the help of a statistical expert is required.
Estimating and characterizing the burden of multimorbidity in the community: A comprehensive multistep analysis of two large nationwide representative surveys in France
The burden of multimorbidity in the adult population in France is estimated in terms of number and type of conditions, type of underlying mechanisms, and analysis of the joint effects for identifying combinations with the most deleterious interaction effects on health status.
A comparison of full model specification and backward elimination of potential confounders when estimating marginal and conditional causal effects on binary outcomes from observational data.
A common view in epidemiology is that automated confounder selection methods, such as backward elimination, should be avoided as they can lead to biased effect estimates and underestimation of their
Deselection of base-learners for statistical boosting—with an application to distributional regression
A new procedure for enhanced variable selection for component-wise gradient boosting by giving the algorithm the chance to deselect base-learners with minor importance to overcome the issue of too many variables in some situations.
A prediction model for the decline in renal function in people with type 2 diabetes mellitus: study protocol
The proposed state-of-the-art methodology will be developed using multiple multicentre study cohorts of people with DM2 in various CKD stages at baseline, who have received modern therapeutic treatment strategies of diabetic kidney disease in contrast to previous models.


Selection of important variables and determination of functional form for continuous predictors in multivariable model building
It is argued why MFP is the preferred approach for multivariable model building with continuous covariates, and it is shown that spline modelling, while extremely flexible, can generate fitted curves with uninterpretable 'wiggles'.
On stability issues in deriving multivariable regression models
Bootstrap resampling will be used to assess variable selection stability, to derive a predictor that incorporates model uncertainty, check for influential points, and visualize the variable selection process.
Purposeful selection of variables in logistic regression
An algorithm which automates the purposeful selection of covariates within which an analyst makes a variable selection decision at each step of the modeling process and has the capability of retaining important confounding variables, resulting potentially in a slightly richer model.
Categorical variables with many categories are preferentially selected in bootstrap‐based model selection procedures for multivariable regression models
If automated variable selection is conducted on bootstrap samples, variables with more categories are substantially favored over variables with fewer categories and over metric variables even if none of them have any effect.
Variable selection – A review and recommendations for the practicing statistician
Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for
Five myths about variable selection
  • G. Heinze, D. Dunkler
  • Computer Science
    Transplant international : official journal of the European Society for Organ Transplantation
  • 2017
It is emphasized that variable selection and all problems related with it can often be avoided by the use of expert knowledge, and how five common misconceptions often lead to inappropriate application of variable selection is discussed.
A bootstrap resampling procedure for model building: application to the Cox regression model.
A bootstrap-model selection procedure is developed, combining the bootstrap method with existing selection techniques such as stepwise methods, for the selection of variables in the framework of a regression model which might influence the outcome variable.
The Use of Resampling Methods to Simplify Regression Models in Medical Statistics
The problems of replication stability, model complexity, selection bias and an overoptimistic estimate of the predictive value of a model are discussed together with several proposals based on resampling methods, which favour greater simplicity of the final regression model.
Applied Logistic Regression
Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.