Bayesian nonparametric cross-study validation of prediction methods
@article{Trippa2015BayesianNC, title={Bayesian nonparametric cross-study validation of prediction methods}, author={Lorenzo Trippa and Levi Waldron and Curtis Huttenhower and Giovanni Parmigiani}, journal={arXiv: Applications}, year={2015} }
We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second…
27 Citations
Integration of survival data from multiple studies.
- BiologyBiometrics
- 2021
A statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles is introduced and it is shown that the proposed model increases the accuracy of survival predictions compared to alternative meta-analytic methods.
Covariate-Profile Similarity Weighting and Bagging Studies with the Study Strap: Multi-Study Learning for Human Neurochemical Sensing
- Computer Science
- 2019
This work introduces two generalizations of multi-study ensemble predictions, and introduces a hierarchical resampling scheme to generate pseudo-study replicates ("study straps") and ensemble classifiers trained on these rather than the original studies themselves.
Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling
- Computer ScienceArXiv
- 2022
The transition point theory from Guan, Parmigiani and Patil (2019) to boosting with linear learners was extended and a bias-variance decomposition of estimation error conditional on the selection path was characterized for boosting with component-wise linear learners.
Low-cost scalable discretization, prediction, and feature selection for complex systems
- Computer ScienceScience Advances
- 2020
A low-cost improved quality scalable probabilistic approximation algorithm, allowing for simultaneous data-driven optimal discretization, feature selection, and prediction, is introduced and it is proved its optimality, parallel efficiency, and a linear scalability of iteration cost.
MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature
- BiologybioRxiv
- 2018
The MetaGxData package compendium is demonstrated to be a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer and to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types.
MetaGxData: Breast and Ovarian Clinically Annotated Transcriptomics Datasets
- Biology
- 2016
The MetaGxData package compendium is developed, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across both breast and ovarian cancer microarray data, and the flexible framework, unified nomenclature is presented.
Large-scale predictive modeling and analytics through regression queries in data management systems
- Computer ScienceInternational Journal of Data Science and Analytics
- 2018
This work contributes with a novel predictive analytics model and an associated statistical learning methodology which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS.
Hurricane Forecasting Using by Parallel Calculations & Machine Learning
- Environmental Science, Physics2018 IEEE First International Conference on System Analysis & Intelligent Computing (SAIC)
- 2018
This research is devoted to determinate the causal relationship between the flow of particles that are coming from the Sun and emergence of the hurricanes Irma, Jose, and Katia. Five parameters i.e.…
Invariance and variability in interaction error-related potentials and their consequences for classification
- PsychologyJournal of neural engineering
- 2017
It is found that interaction ErrPs are empirically invariant over time ( for the same subject and same interface) and to a lesser extent across subjects (for the same interface).
Market Forecasts and Client Behavioral Data: Towards Finding Adequate Model Complexity
- EconomicsStudia Universitatis „Vasile Goldis” Arad – Economics Series
- 2018
A modeling cycle is proposed which provides information about the adequacy of a model complexity class and which also highlights some nonstandard measures of expected model performance.
References
SHOWING 1-10 OF 57 REFERENCES
Cross-study validation for the assessment of prediction algorithms
- Computer ScienceBioinform.
- 2014
This work develops and implements a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets, and suggests that standard cross- validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross- study validation.
Optimized application of penalized regression methods to diverse genomic data
- Computer ScienceBioinform.
- 2011
This work provides an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data and demonstrates the real-life application to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data.
On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.
- MathematicsStatistics in medicine
- 2011
A simple C-statistic is presented which consistently estimates a conventional concordance measure which is free of censoring and results from numerical studies suggest that the new procedure performs well in finite sample.
A Bayesian Semiparametric Model for Random-Effects Meta-Analysis
- Mathematics
- 2005
In meta-analysis, there is an increasing trend to explicitly acknowledge the presence of study variability through random-effects models. That is, one assumes that for each study there is a…
Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer.
- MedicineJournal of the National Cancer Institute
- 2014
This work addresses outstanding controversies in the ovarian cancer literature and provides a reproducible framework for meta-analytic evaluation of gene signatures and confirms that these require improvement to be of clinical value.
Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
- BiologyNature Medicine
- 2008
A large, training–testing, multi-site, blinded validation study to characterize the performance of several prognostic models based on gene expression for 442 lung adenocarcinomas, providing the largest available set of microarray data with extensive pathological and clinical annotation for lungAdenocARCinomas.
Statistical Comparisons of Classifiers over Multiple Data Sets
- Computer ScienceJ. Mach. Learn. Res.
- 2006
A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples.
- Medicine, BiologyJournal of the National Cancer Institute
- 2014
The survival signature provides the most accurate and validated prognostic model for early- and advanced-stage high-grade, serous ovarian cancer and the debulking signature accurately predicts the outcome of cytoreductive surgery, potentially allowing for stratification of patients for primary vs secondary cytoreduction.
curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome
- Biology, Computer ScienceDatabase J. Biol. Databases Curation
- 2013
A manually curated data collection for gene expression meta-analysis of patients with ovarian cancer and software for reproducible preparation of similar databases are introduced.
Impact of Bioinformatic Procedures in the Development and Translation of High-Throughput Molecular Classifiers in Oncology
- MedicineClinical Cancer Research
- 2013
This article uses publicly available expression data from patients with non–small cell lung cancer to first illustrate the challenges of experimental design and preprocessing of data before clinical application and highlights theChallenges of high-dimensional statistical analysis.