Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets

  title={Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets},
  author={Michael K. K. Leung and Andrew Delong and Babak Alipanahi and Brendan J. Frey},
  journal={Proceedings of the IEEE},
In this paper, we provide an introduction to machine learning tasks that address important problems in genomic medicine. One of the goals of genomic medicine is to determine how variations in the DNA of individuals can affect the risk of different diseases, and to find causal explanations so that targeted therapies can be designed. Here we focus on how machine learning can help to model the relationship between DNA and the quantities of key molecules in the cell, with the premise that these… 

Figures and Tables from this paper

Inferring phenotypes from genotypes with machine learning : an application to the global problem of antibiotic resistance

The overarching theme of this thesis is an application to the prediction of antibiotic resistance, a global public health problem of high significance, and it is demonstrated that algorithms can be used to accurately predict resistance phenotypes and contribute to the improvement of their understanding.

Machine learning and clinical epigenetics: a review of challenges for diagnosis and classification

The authors now have an ever-growing number of reported epigenetic alterations in disease, and this offers a chance to increase sensitivity and specificity of future diagnostics and therapies, as machine learning methods are on the rise.

A Survey on Classification Analysis for Cancer Genomics: Limitations and Novel Opportunity in the Era of Cancer Classification and Target Therapies

This work focuses on the most up-to-date knowledge of cancer classification models, targeted therapy, and defines how genetic mutations inspire targeted therapy's responsiveness and highlight the different related issues in this era of precision medicine.

Diet Networks: Thin Parameters for Fat Genomics

A novel neural network parametrization is proposed which considerably reduces the number of parameters and the error rate of the classifier on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms, yielding millions of ternary inputs.

Machine and deep learning meet genome-scale metabolic modeling

How machine learning and constraint-based modeling can be combined is described, reviewing recent works at the intersection of both domains and discussing the mathematical and practical aspects involved, as well as overlapping systematic classifications from both frameworks.

Interpreting regulatory variants with predictive models

This thesis developed MMSplice, a modular deep learning framework to predict effect of genetic variants on splicing in human cells, which outperformed state-of-the-art models and was the winning model of the 5th Critical Assessment of Genome Interpretation (CAGI) exon-skipping competition.

Deep Learning in Pharmacogenomics: From Gene Regulation to Patient Stratification

This Perspective provides examples of current and future applications of deep learning in pharmacogenomics, including: identification of novel regulatory variants located in noncoding domains of the

Machine Learning Techniques for Analysis of Human Genome Data

Current methods and trends in various machine learning and data mining approaches which are very complex and challenging to model and evaluate the performances are reviewed.



Extracting sequence features to predict protein–DNA interactions: a comparative study

It is found that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach, and BART and boosting show the best and the most robust overall performance among all the methods.

Deep learning of the tissue-regulated splicing code

The deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns and demonstrates that deep architectures can be beneficial, even with a moderately sparse dataset.

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

This work shows that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery.

Methods of integrating data to uncover genotype–phenotype interactions

The emerging approaches for data integration — including meta-dimensional and multi-staged analyses — which aim to deepen the understanding of the role of genetics and genomics in complex outcomes are explored.

TCPA: a resource for cancer functional proteomics data

This work has generated the largest publicly available collection of cancer functional proteomics data with parallel DNA and RNA data over a large number of tumor and cell line samples using reverse-phase protein arrays (RPPAs).

Machine learning for science and society

In the era of “big data,” there is a need for machine learning to address important large-scale applied problems, yet it is difficult to find top venues in machine learning where such work is encouraged.

A method and server for predicting damaging missense mutations

A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.

Risk estimation and risk prediction using machine-learning methods

Methods for the construction and evaluation of classification and probability estimation rules and their application to a genome-wide association analysis on rheumatoid arthritis are described.

Five years of GWAS discovery.

Emerging patterns of somatic mutations in cancer

The developing statistical approaches that are used to identify significantly mutated genes are highlighted, and the emerging biological and clinical insights from such analyses are discussed, as well as the future challenges of translating these genomic data into clinical impacts.