Bayesian nonparametric cross-study validation of prediction methods

@article{Trippa2015BayesianNC,
  title={Bayesian nonparametric cross-study validation of prediction methods},
  author={Lorenzo Trippa and Levi Waldron and Curtis Huttenhower and Giovanni Parmigiani},
  journal={arXiv: Applications},
  year={2015}
}
We consider comparisons of statistical learning algorithms using multiple data sets, via leave-one-in cross-study validation: each of the algorithms is trained on one data set; the resulting model is then validated on each remaining data set. This poses two statistical challenges that need to be addressed simultaneously. The first is the assessment of study heterogeneity, with the aim of identifying a subset of studies within which algorithm comparisons can be reliably carried out. The second… Expand
Integration of survival data from multiple studies.
TLDR
A statistical procedure that integrates datasets from multiple biomedical studies to predict patients' survival, based on individual clinical and genomic profiles is introduced and it is shown that the proposed model increases the accuracy of survival predictions compared to alternative meta-analytic methods. Expand
Covariate-Profile Similarity Weighting and Bagging Studies with the Study Strap: Multi-Study Learning for Human Neurochemical Sensing
TLDR
This work introduces two generalizations of multi-study ensemble predictions, and introduces a hierarchical resampling scheme to generate pseudo-study replicates ("study straps") and ensemble classifiers trained on these rather than the original studies themselves. Expand
Low-cost scalable discretization, prediction, and feature selection for complex systems
TLDR
A low-cost improved quality scalable probabilistic approximation algorithm, allowing for simultaneous data-driven optimal discretization, feature selection, and prediction, is introduced and it is proved its optimality, parallel efficiency, and a linear scalability of iteration cost. Expand
PREDICTION OF HEREDITARY CANCERS USING NEURAL NETWORKS BY
Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions based on knowledge of cancer susceptibilityExpand
Prediction of Hereditary Cancers Using Neural Networks
TLDR
A framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer is developed and it is demonstrated that the proposed neural network models are able to achieve nearly optimal prediction performance. Expand
MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature
TLDR
It is demonstrated that MetaGxData is a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer and to create the first gene signature that is prognostic in a meta-analysis across 3 cancers. Expand
MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature
TLDR
The MetaGxData package compendium is demonstrated to be a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer and to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types. Expand
Low-cost scalable discretization, prediction and feature selection for complex systems
TLDR
This work introduces a low-cost improved-quality Scalable Probabilistic Approximation algorithm, allowing for simultaneous data-driven optimal discretization, feature selection and prediction in a range of large realistic data classification and prediction problems. Expand
MetaGxData: Breast and Ovarian Clinically Annotated Transcriptomics Datasets
TLDR
The MetaGxData package compendium is developed, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across both breast and ovarian cancer microarray data, and the flexible framework, unified nomenclature is presented. Expand
Large-scale predictive modeling and analytics through regression queries in data management systems
TLDR
This work contributes with a novel predictive analytics model and an associated statistical learning methodology which are efficient, scalable and accurate in discovering piecewise linear dependencies among variables by observing only regression queries and their answers issued to a DMS. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 57 REFERENCES
Cross-study validation for the assessment of prediction algorithms
TLDR
This work develops and implements a systematic approach to ‘cross-study validation’, to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets, and suggests that standard cross- validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross- study validation. Expand
Cross-study validation and combined analysis of gene expression microarray data.
TLDR
It is illustrated that it is possible to identify a substantial biologically relevant subset of the human genome within which hybridization results are reliable and to develop simple expression measures that allow comparison across platforms, studies, laboratories and populations. Expand
On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data.
TLDR
A simple C-statistic is presented which consistently estimates a conventional concordance measure which is free of censoring and results from numerical studies suggest that the new procedure performs well in finite sample. Expand
Optimized application of penalized regression methods to diverse genomic data
TLDR
This work provides an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data and demonstrates the real-life application to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Expand
A Bayesian Semiparametric Model for Random-Effects Meta-Analysis
In meta-analysis, there is an increasing trend to explicitly acknowledge the presence of study variability through random-effects models. That is, one assumes that for each study there is aExpand
Comparative meta-analysis of prognostic gene signatures for late-stage ovarian cancer.
TLDR
This work addresses outstanding controversies in the ovarian cancer literature and provides a reproducible framework for meta-analytic evaluation of gene signatures and confirms that these require improvement to be of clinical value. Expand
Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study
TLDR
A large, training–testing, multi-site, blinded validation study to characterize the performance of several prognostic models based on gene expression for 442 lung adenocarcinomas, providing the largest available set of microarray data with extensive pathological and clinical annotation for lungAdenocARCinomas. Expand
Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data
TLDR
A kernel Cox regression model for relating gene expression profiles to censored phenotypes in the framework the penalization method in terms of function estimation in reproducing kernel Hilbert spaces is developed and indicates that the proposed method works very well in identifying subgroups of patients with different risks of death or relapse and in predicting the risk of relapse or death. Expand
Statistical Comparisons of Classifiers over Multiple Data Sets
  • J. Demsar
  • Computer Science
  • J. Mach. Learn. Res.
  • 2006
TLDR
A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets. Expand
Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples.
TLDR
The survival signature provides the most accurate and validated prognostic model for early- and advanced-stage high-grade, serous ovarian cancer and the debulking signature accurately predicts the outcome of cytoreductive surgery, potentially allowing for stratification of patients for primary vs secondary cytoreduction. Expand
...
1
2
3
4
5
...