Iterative random forests to discover predictive and stable high-order interactions

@article{Basu2018IterativeRF,
  title={Iterative random forests to discover predictive and stable high-order interactions},
  author={Sumanta Basu and Karl Kumbier and James B. Brown and Bin Yu},
  journal={Proceedings of the National Academy of Sciences of the United States of America},
  year={2018},
  volume={115},
  pages={1943 - 1948}
}
  • Sumanta Basu, Karl Kumbier, Bin Yu
  • Published 26 June 2017
  • Biology
  • Proceedings of the National Academy of Sciences of the United States of America
Significance We developed a predictive, stable, and interpretable tool: the iterative random forest algorithm (iRF). iRF discovers high-order interactions among biomolecules with the same order of computational cost as random forests. We demonstrate the efficacy of iRF by finding known and promising interactions among biomolecules, of up to fifth and sixth order, in two data examples in transcriptional regulation and alternative splicing. Genomics has revolutionized biology, enabling the… 

Figures from this paper

JigSaw: A tool for discovering explanatory high-order interactions from random forests
TLDR
JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.
A High-Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
TLDR
This paper presents a high-performance computing (HPC)-capable implementation of Iterative Random Forest (iRF), which enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs and presents a new method, iRF Leave One Out Prediction (i RF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more.
Learning epistatic polygenic phenotypes with Boolean interactions
TLDR
The epiTree pipeline to extract higher-order interactions from genomic data using tree-based models is introduced and it is found that individual Boolean or tree- based epistasis models generally provide higher prediction accuracy than classical logistic regression.
A High Performance Computing Implementation of Iterative Random Forest for the Creation of Predictive Expression Networks
TLDR
This paper presents a high performance computing(HPC)-capable implementation of Iterative Random Forest (iRF), which enables the explainable-AI eQTL analysis of SNP sets with over a million SNPs and presents a new method, iRF Leave One Out Prediction (i RF-LOOP), for the creation of Predictive Expression Networks on the order of 40,000 genes or more.
BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes
TLDR
A suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables frequently used for classification and provides a new way of using decision trees to generate testable biological hypotheses.
BowSaw: inferring higher-order trait interactions associated with complex biological phenotypes
TLDR
A suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables frequently used for classification, and provides a new way of using decision trees to generate testable biological hypotheses.
Identifying genetic determinants of complex phenotypes from whole genome sequence data
TLDR
It is demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even when trained on small numbers of individuals.
Identifying genetic determinants of complex phenotypes from whole genome sequence data andvery small training sets
TLDR
It is demonstrated that ML algorithms can be used to identify genetic determinants in small proteomes (viruses), even with small numbers of individuals, and that chunking improved runtimes by an order of magnitude and may increase sensitivity of the predictions.
Pathway-Based Single-Cell RNA-Seq Classification, Clustering, and Construction of Gene-Gene Interactions Networks Using Random Forests
TLDR
A pathway-based analytic framework using Random Forests to identify discriminative functional pathways related to cellular heterogeneity as well as to cluster cell populations for scRNA-Seq data is proposed and a novel method to construct gene-gene interactions (GGIs) networks using RF that illustrates important GGIs in differentiating cell populations is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 109 REFERENCES
A balanced iterative random forest for gene selection from microarray data
TLDR
The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data and outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data.
Global Quantitative Modeling of Chromatin Factor Interactions
TLDR
A global modeling framework that leverages chromatin profiling data to produce a systems-level view of the macromolecular complex of chromatin, and provides a highly accurate predictor of Chromatin factor pairwise interactions validated by known experimental evidence, and for the first time enabled higher-order interaction prediction.
Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data
TLDR
It is shown that TSKCCA can extract multiple, nonlinear associations among high-dimensional data and multiplicative interactions among variables more reliably than previous nonlinear CCA methods.
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.
One of the greatest challenges facing human geneticists is the identification and characterization of susceptibility genes for common complex multifactorial human diseases. This challenge is partly
Modeling gene expression using chromatin features in various cellular contexts
TLDR
This study builds a novel quantitative model and finds that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy, and that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq.
Enriched random forests
TLDR
This work proposes a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features.
Integrative annotation of chromatin elements from ENCODE data
TLDR
These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types.
Bayesian inference of epistatic interactions in case-control studies
TLDR
It is demonstrated that the proposed 'bayesian epistasis association mapping' method is significantly more powerful than existing approaches and that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible.
...
1
2
3
4
5
...