Bayesian nonparametric cross-study validation of prediction methods

  title={Bayesian nonparametric cross-study validation of prediction methods},
  author={Lorenzo Trippa and Levi Waldron and Curtis Huttenhower and Giovanni Parmigiani},
  journal={arXiv: Applications},
We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second… 
Integration of survival data from multiple studies.
A statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles is introduced and it is shown that the proposed model increases the accuracy of survival predictions compared to alternative meta-analytic methods.
Covariate-Profile Similarity Weighting and Bagging Studies with the Study Strap: Multi-Study Learning for Human Neurochemical Sensing
This work introduces two generalizations of multi-study ensemble predictions, and introduces a hierarchical resampling scheme to generate pseudo-study replicates ("study straps") and ensemble classifiers trained on these rather than the original studies themselves.
Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling
The transition point theory from Guan, Parmigiani and Patil (2019) to boosting with linear learners was extended and a bias-variance decomposition of estimation error conditional on the selection path was characterized for boosting with component-wise linear learners.
Low-cost scalable discretization, prediction, and feature selection for complex systems
A low-cost improved quality scalable probabilistic approximation algorithm, allowing for simultaneous data-driven optimal discretization, feature selection, and prediction, is introduced and it is proved its optimality, parallel efficiency, and a linear scalability of iteration cost.
MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature
The MetaGxData package compendium is demonstrated to be a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer and to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types.
MetaGxData: Breast and Ovarian Clinically Annotated Transcriptomics Datasets
The MetaGxData package compendium is developed, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across both breast and ovarian cancer microarray data, and the flexible framework, unified nomenclature is presented.
Large-scale predictive modeling and analytics through regression queries in data management systems
This work contributes with a novel predictive analytics model and an associated statistical learning methodology which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS.
Hurricane Forecasting Using by Parallel Calculations & Machine Learning
This research is devoted to determinate the causal relationship between the flow of particles that are coming from the Sun and emergence of the hurricanes Irma, Jose, and Katia. Five parameters i.e.
Invariance and variability in interaction error-related potentials and their consequences for classification
It is found that interaction ErrPs are empirically invariant over time ( for the same subject and same interface) and to a lesser extent across subjects (for the same interface).
Market Forecasts and Client Behavioral Data: Towards Finding Adequate Model Complexity
A modeling cycle is proposed which provides information about the adequacy of a model complexity class and which also highlights some nonstandard measures of expected model performance.


Cross-study validation for the assessment of prediction algorithms
This work develops and implements a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets, and suggests that standard cross- validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross- study validation.
Optimized application of penalized regression methods to diverse genomic data
This work provides an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data and demonstrates the real-life application to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data.
On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.
A simple C-statistic is presented which consistently estimates a conventional concordance measure which is free of censoring and results from numerical studies suggest that the new procedure performs well in finite sample.
A Bayesian Semiparametric Model for Random-Effects Meta-Analysis
In meta-analysis, there is an increasing trend to explicitly acknowledge the presence of study variability through random-effects models. That is, one assumes that for each study there is a
Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer.
This work addresses outstanding controversies in the ovarian cancer literature and provides a reproducible framework for meta-analytic evaluation of gene signatures and confirms that these require improvement to be of clinical value.
Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
A large, training–testing, multi-site, blinded validation study to characterize the performance of several prognostic models based on gene expression for 442 lung adenocarcinomas, providing the largest available set of microarray data with extensive pathological and clinical annotation for lungAdenocARCinomas.
Statistical Comparisons of Classifiers over Multiple Data Sets
  • J. Demšar
  • Computer Science
    J. Mach. Learn. Res.
  • 2006
A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples.
The survival signature provides the most accurate and validated prognostic model for early- and advanced-stage high-grade, serous ovarian cancer and the debulking signature accurately predicts the outcome of cytoreductive surgery, potentially allowing for stratification of patients for primary vs secondary cytoreduction.
curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome
A manually curated data collection for gene expression meta-analysis of patients with ovarian cancer and software for reproducible preparation of similar databases are introduced.
Impact of Bioinformatic Procedures in the Development and Translation of High-Throughput Molecular Classifiers in Oncology
This article uses publicly available expression data from patients with non–small cell lung cancer to first illustrate the challenges of experimental design and preprocessing of data before clinical application and highlights theChallenges of high-dimensional statistical analysis.