Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming

@article{Sohn2017TowardTA,
  title={Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming},
  author={Andrew Sohn and Randal S. Olson and Jason H. Moore},
  journal={Proceedings of the Genetic and Evolutionary Computation Conference},
  year={2017}
}
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove… 

Figures from this paper

The promise of automated machine learning for the genetic analysis of complex traits

The promise of AutoML is to enable anyone, regardless of training or expertise, to apply machine learning as part of their genetic analysis strategy and it is hoped that this review will motivate studies to develop and evaluate novel AutoML methods and software in the genetics and genomics space.

Artificial Intelligence Based Approaches to Identify Molecular Determinants of Exceptional Health and Life Span-An Interdisciplinary Workshop at the National Institute on Aging

The workshop involved experts in the fields of aging, comparative biology, cardiology, cancer, and computational science/AI who brainstormed ideas on how AI can be leveraged for the analyses of large-scale data sets from human epidemiological studies and animal/model organisms to close the current knowledge gaps in processes that drive exceptional life and health span.

How Computational Experiments Can Improve Our Understanding of the Genetic Architecture of Common Human Diseases

A heuristic simulation-based method for conducting experiments about the complexity of genetic architecture is developed and shown that a genetic architecture driven by complex interactions is highly consistent with the magnitude and distribution of univariate effects seen in real data.

Automatic Tuning of Rule-Based Evolutionary Machine Learning via Problem Structure Identification

This work presents a parameter setting mechanism for a rule-based evolutionary machine learning system that is capable of finding the adequate parameter value for a wide variety of synthetic classification problems with binary attributes and with/without added noise.

KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis

KLFDAPC can be useful for geographic ancestry inference, design of genome scans and correction for spatial stratification in GWAS that link genes to adaptation or disease susceptibility.

ARABIC NAMED ENTITY RECOGNITION BASED ON TREE- BASED PIPELINE OPTIMIZATION TOOL

This method uses genetic programming based on the tree structure to find the model and its hyperparameters that more closely predicts the class of Arabic named entities in the text comes from social media.

Comparision of diabetic prediction AutoML model with customized model

Output of AutoML models is compared to the manually created model for non-invasive diabetes prediction carried out on Pima Indian Diabetes Dataset using TPOT and H2O AutoML platforms which is later compared with the performance of the manual model.

Survey of Metaheuristics and Statistical Methods for Multifactorial Diseases Analyses

A survey of metaheuristics and statistical methods integrated in the field of human genetics and specifically multifactorial diseases in order to help genetics to find interaction between genes and environemental factor involved in those diseases.

A System for Accessible Artificial Intelligence

An ongoing project is presented whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains.

An Extensive Experimental Evaluation of Automated Machine Learning Methods for Recommending Classification Algorithms (Extended Version)

The EA evolving decision-tree induction algorithms has the advantage of producing algorithms that generate interpretable classification models and that are more scalable to large datasets, by comparison with many algorithms from other learning paradigms that can be recommended by Auto-WEKA.

References

SHOWING 1-10 OF 50 REFERENCES

Automating Biomedical Data Science Through Tree-Based Pipeline Optimization

This work implements a Tree-based Pipeline Optimization Tool (TPOT) and shows that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators—such as synthetic feature constructors—that significantly improve classification accuracy on these data sets.

Exploiting Expert Knowledge in Genetic Programming for Genome-Wide Genetic Analysis

This study demonstrates that GP may be a useful computational discovery tool in this domain and shows that using expert knowledge to select trees performs as well as a multiobjective fitness function but requires only a tenth of the population size.

Genetic Analysis of Prostate Cancer Using Computational Evolution, Pareto-Optimization and Post-processing

The goal of the present study is to make extensions and enhancements to a computational evolution system (CES) that has the ultimate objective of tinkering with data as a human would by introducing the use of Pareto-optimization to help address overfitting in the learning system.

TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning

This chapter presents TPOT v0.3, an open source genetic programming-based AutoML system that optimizes a series of feature preprocessors and machine learning models with the goal of maximizing classification accuracy on a supervised classification task.

Bioinformatics challenges for genome-wide association studies

It is argued here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases as well as those GWAS challenges that will require computational methods.

Identifying and Harnessing the Building Blocks of Machine Learning Pipelines for Sensible Initialization of a Data Science Automation Tool

This chapter presents a genetic programming-based AutoML system called TPOT that optimizes a series of feature preprocessors and machine learning models with the goal of maximizing classification accuracy on a supervised classification problem.

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

This paper implements an open source Tree-based Pipeline Optimization Tool (TPOT) in Python and shows that TPOT can design machine learning pipelines that provide a significant improvement over a basic machine learning analysis while requiring little to no input nor prior knowledge from the user.

challenges for genome-wide association studies

It is argued here that bioinformatics has an important role to play in addressing the complexity of the underlying genetic basis of common human diseases as well as those GWAS challenges that will require computational methods.

Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach

The hypothesis that an LCS approach can offer greater insight into complex patterns of association can be supported, as this methodology appears to be well suited to the dissection of disease heterogeneity, a key component in the advancement of personalized medicine.

Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions

A multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension thus permitting interactions to be detected in relatively small sample sizes is developed.