• Corpus ID: 237572213

Optimal Ensemble Construction for Multi-Study Prediction with Applications to COVID-19 Excess Mortality Estimation

@article{Loewinger2021OptimalEC,
  title={Optimal Ensemble Construction for Multi-Study Prediction with Applications to COVID-19 Excess Mortality Estimation},
  author={Gabriel Loewinger and Rolando Acosta Nunez and Rahul Mazumder and Giovanni Parmigiani},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.09164}
}
It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets and applying standard statistical learning methods can result in poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown $\textit{multi-study ensembling}$ to be a viable alternative that leverages the variability across datasets in a manner that promotes… 

Merging versus Ensembling in Multi-Study Machine Learning: Theoretical Insight from Random Effects

This work shows analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies, and provides analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metabolomics.

Cross-study learning for generalist and specialist predictions

It is proved that under certain regularity conditions, the proposed framework produces a stacked prediction function with oracle property, and is applied to predicting mortality based on a collection of variables including long-term exposure to common air pollutants.

Hierachical Resampling for Bagging in Multi-Study Prediction with Applications to Human Neurochemical Sensing

We propose the “study strap ensemble,” which combines advantages of two common approaches to fitting prediction models when multiple training datasets (“studies”) are available: pooling studies and

Cross-study validation for the assessment of prediction algorithms

This work develops and implements a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets, and suggests that standard cross- validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross- study validation.

Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate

This work provides a statistical formulation in the large-sample limit: studies themselves are modeled as components of a mixture and all error rates are optimal (Bayes) for a two-class problem and suggests what is likely to be observed with large samples and consistent density estimators.

Modeling Between-Study Heterogeneity for Improved Replicability in Gene Signature Selection and Clinical Prediction

A novel approach is introduced to select gene signatures from multiple datasets whose effects are consistently nonzero and account for between-study heterogeneity to improve replicability.

Ensemble Regression Models for Short-term Prediction of Confirmed COVID-19 Cases

A regression based ensemble learning model comprising of Linear regression, Ridge, Lasso, ARIMA, and SVR that takes the previous 14 days’ data into account to predict the number of new COVID19 cases in the short-term shows superior prediction performance for a vast majority of countries.

Measuring the Effect of Inter-Study Variability on Estimating Prediction Error

By examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when “sufficient” diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.

Robustifying genomic classifiers to batch effects via ensemble learning

A systematic comparison between the strategies of integrating predictions rather than data and a different strategy based on ensemble learning is provided, which yields better discrimination in independent validation than the traditional method of integrating the data.

Multi-Source Causal Inference Using Control Variates

This work proposes a general algorithm to estimate causal effects from multiple data sources, where the average treatment effect (ATE) may be identifiable only in some datasets but not others, and shows theoretically that this reduces the variance of the ATE estimate.