Iterative random forests to discover predictive and stable high-order interactions
@article{Basu2018IterativeRF, title={Iterative random forests to discover predictive and stable high-order interactions}, author={Sumanta Basu and Karl Kumbier and James B. Brown and Bin Yu}, journal={Proceedings of the National Academy of Sciences of the United States of America}, year={2018}, volume={115}, pages={1943 - 1948} }
Significance We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing. Genomics has revolutionized biology, enabling the…
136 Citations
JigSaw: A tool for discovering explanatory high-order interactions from random forests
- Computer ScienceArXiv
- 2020
JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
- Computer ScienceGenes
- 2019
This paper presents a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF), which enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs and presents a new method, iRF Leave One Out Prediction (i RF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more.
Learning epistatic polygenic phenotypes with Boolean interactions
- Computer SciencebioRxiv
- 2020
The epiTree pipeline to extract higher-order interactions from genomic data using tree-based models is introduced and it is found that individual Boolean or tree- based epistasis models generally provide higher prediction accuracy than classical logistic regression.
A High Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
- Computer Science
- 2019
This paper presents a high performance computing(HPC)-capable implementation of Iterative Random Forest (iRF), which enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs and presents a new method, iRF Leave One Out Prediction (i RF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more.
BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes
- Computer Science, Environmental ScienceFrontiers in Molecular Biosciences
- 2021
A suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables frequently used for classification and provides a new way of using decision trees to generate testable biological hypotheses.
BowSaw: inferring higher-order trait interactions associated with complex biological phenotypes
- Computer Science, Environmental SciencebioRxiv
- 2019
A suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables frequently used for classification, and provides a new way of using decision trees to generate testable biological hypotheses.
Uncovering Effective Explanations for Interactive Genomic Data Analysis
- Computer SciencePatterns
- 2020
Identifying genetic determinants of complex phenotypes from whole genome sequence data
- BiologyBMC Genomics
- 2019
It is demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals.
Identifying genetic determinants of complex phenotypes from whole genome sequence data andvery small training sets
- BiologybioRxiv
- 2018
It is demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even with small numbers of individuals, and that chunking improved runtimes by an order of magnitude and may increase sensitivity of the predictions.
Pathway-Based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests
- BiologyIEEE Journal of Biomedical and Health Informatics
- 2020
A pathway-based analytic framework using Random Forests to identify discriminative functional pathways related to cellular heterogeneity as well as to cluster cell populations for scRNA-Seq data is proposed and a novel method to construct gene-gene interactions (GGIs) networks using RF that illustrates important GGIs in differentiating cell populations is proposed.
References
SHOWING 1-10 OF 109 REFERENCES
A balanced iterative random forest for gene selection from microarray data
- Computer Science, BiologyBMC Bioinformatics
- 2013
The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data and outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data.
Global Quantitative Modeling of Chromatin Factor Interactions
- Biology, Computer SciencePLoS Comput. Biol.
- 2014
A global modeling framework that leverages chromatin profiling data to produce a systems-level view of the macromolecular complex of chromatin, and provides a highly accurate predictor of Chromatin factor pairwise interactions validated by known experimental evidence, and for the first time enabled higher-order interaction prediction.
Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
- Computer ScienceBMC Bioinformatics
- 2017
It is shown that TSKCCA can extract multiple, nonlinear associations among high-dimensional data and multiplicative interactions among variables more reliably than previous nonlinear CCA methods.
eFORGE: A Tool for Identifying Cell Type-Specific Signal in Epigenomic Data
- BiologyCell reports
- 2016
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.
- BiologyAmerican journal of human genetics
- 2001
One of the greatest challenges facing human geneticists is the identification and characterization of susceptibility genes for common complex multifactorial human diseases. This challenge is partly…
Modeling gene expression using chromatin features in various cellular contexts
- BiologyGenome Biology
- 2012
This study builds a novel quantitative model and finds that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy, and that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq.
Extensive Promoter-Centered Chromatin Interactions Provide a Topological Basis for Transcription Regulation
- BiologyCell
- 2012
Enriched random forests
- Computer ScienceBioinform.
- 2008
This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.
Integrative annotation of chromatin elements from ENCODE data
- Computer Science, BiologyNucleic acids research
- 2013
These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types.
Bayesian inference of epistatic interactions in case-control studies
- BiologyNature Genetics
- 2007
It is demonstrated that the proposed 'bayesian epistasis association mapping' method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.