Knowledge-based data analysis comes of age

  title={Knowledge-based data analysis comes of age},
  author={Michael F. Ochs},
  journal={Briefings in bioinformatics},
  volume={11 1},
  • M. Ochs
  • Published 2010
  • Computer Science
  • Briefings in bioinformatics
The emergence of high-throughput technologies for measuring biological systems has introduced problems for data interpretation that must be addressed for proper inference. First, analysis techniques need to be matched to the biological system, reflecting in their mathematical structure the underlying behavior being studied. When this is not done, mathematical techniques will generate answers, but the values and reliability estimates may not accurately reflect the biology. Second, analysis… 

Figures from this paper

Inferential stability in systems biology

A novel algorithm is presented for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected, and the importance of finding distributions of ODE parameter estimates, rather than single point estimates.

Knowledge-guided differential dependency network learning for detecting structural changes in biological networks

This work forms the inference of condition-specific network structures that incorporates relevant prior knowledge as a convex optimization problem, and develops an efficient learning algorithm that fully exploits the benefit of prior knowledge while remaining robust to the false positive edges in the knowledge.

OnionTree XML: A Format to Exchange Gene-Related Probabilities

The goal is an XML to encode relationships as probabilities of interactions for the purposes of genetics and bioinformatics, with the interpretation of the message in the XML depending on the context of the parser.

Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients

An overview of current methods to quantify and incorporate biological prior knowledge of molecular interactions and known cellular processes into the feature selection process as well as the databases, where this external knowledge can be obtained from are given.

Knowledge-fused differential dependency network models for detecting significant rewiring in biological networks

This work formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions.

Knowledge-Based Compact Disease Models: A Rapid Path from High-Throughput Data to Understanding Causative Mechanisms for a Complex Disease.

The Compact Disease Model (CDM) composed of the gene list distilled by this analytic technique and its network-based representation allowed us to highlight possible role of the protein traffic vesicles in the pathogenesis of Alzheimer's.

Quantitative knowledge-based analysis in compound safety assessment

The authors show several examples of quantitative functional analysis, including cross-tissue toxicity predictions and integrated analysis of different types of OMICs data, illustrating potential advantages of knowledge-based approaches in prediction of human toxicity.

ConReg-R: Extrapolative recalibration of the empirical distribution of p-values to improve false discovery rate estimates

A new extrapolative method called Constrained Regression Recalibration (ConReg-R) is proposed to recalibrate the empirical p-values by modeling their distribution to improve the FDR estimates.

Knowledge-based compact disease models identify new molecular players contributing to early-stage Alzheimer’s disease

A flexible approach for high-throughput data analysis, the Compact Disease Model generation, allows extraction of meaningful, mechanism-centered gene sets compatible with instant translation of the results into testable hypotheses.

Network biology and machine learning approaches to metastasis and treatment response

Inference and analysis of small-scale networks from human tumour tissue samples, scored for protein expression, provides insight into pleiotropy, complex interactions and context-specific behaviour.



Subsystem identification through dimensionality reduction of large-scale gene expression data.

Functional relationships predicted by the new analysis are compared with those predicted using standard approaches; validation using bioinformatic databases suggests predictions using the new approach may be up to twice as accurate as some conventional approaches.

Using Bayesian networks to analyze expression data

This paper proposes a new framework for discovering interactions between genes based on multiple expression measurements, and presents an efficient algorithm capable of learning such networks and statistical method to assess confidence in their features.

Application of Bayesian Decomposition for analysing microarray data

The ability of the algorithm to provide insight into the yeast cell cycle is demonstrated, including identification of five temporal patterns tied to cell cycle phases as well as the identification of a pattern tied to an approximately 40 min cell cycle oscillator.

Seeded Bayesian Networks: Constructing genetic networks from microarray data

The use of network seeds greatly improves the ability of Bayesian Network analysis to learn gene interaction networks from gene expression data, allowing networks involving dynamic processes to be deduced from the static snapshots of biological systems that represent the most common source of microarray data.

Determination of strongly overlapping signaling activity from microarray data

This works demonstrates that microarray data can provide downstream indicators of pathway activity either through use of gene ontology or transcription factor databases and can be used to investigate the specificity and success of targeted therapeutics as well as to elucidate signaling activity in normal and disease processes.

Identifying functional modules using expression profiles and confidence-scored protein interactions

A probabilistic model and a weighting scheme in which the likelihood of the connectivity of a subnetwork is related to the weight of its minimum cut is proposed, which shows that CEZANNE outperforms previous methods for analysis of expression and interaction data.

Determining Transcription Factor Activity from Microarray Data using Bayesian Markov Chain Monte Carlo Sampling

A novel approach is presented that handles both the assignment of genes to multiple patterns, as required by multiple regulation, and the linking of genes in prior probability distributions according to their known transcriptional regulators.

Metagenes and molecular pattern discovery using matrix factorization

Nonnegative matrix factorization is described, an algorithm based on decomposition by parts that can reduce the dimension of expression data from thousands of genes to a handful of metagenes, and found less sensitive to a priori selection of genes or initial conditions and able to detect alternative or context-dependent patterns of gene expression in complex biological systems.

Gene expression trends and protein features effectively complement each other in gene function prediction

The use of Rough Sets for a novel data integration strategy where gene expression data, protein features and Gene Ontology annotations were combined to describe general and biologically relevant patterns represented by If-Then rules, shows that the approach can be used to build very robust models that create synergy from integrating gene expressionData and protein features.

Integrating shotgun proteomics and mRNA expression data to improve protein identification

A Bayesian score is developed that estimates the posterior probability of a protein's presence in the sample given its identification in an MS/MS experiment and its mRNA concentration measured under similar experimental conditions to demonstrate that incorporating prior knowledge of protein presence into shotgun proteomics experiments can substantially improve protein identification scores.