• Corpus ID: 54469647

Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data

  title={Bridging the Generalization Gap: Training Robust Models on Confounded Biological Data},
  author={Tzu-Yu Liu and Ajay Kannan and Adam Drake and Marvin Bertin and Nathan Wan},
Statistical learning on biological data can be challenging due to confounding variables in sample collection and processing. Confounders can cause models to generalize poorly and result in inaccurate prediction performance metrics if models are not validated thoroughly. In this paper, we propose methods to control for confounding factors and further improve prediction performance. We introduce OrthoNormal basis construction In cOnfounding factor Normalization (ONION) to remove confounding… 

Figures and Tables from this paper

A Penalty Approach for Normalizing Feature Distributions to Build Confounder-Free Models

Improvement in model accuracy and independence from the confounders is shown using PMDN over MDN in a synthetic experiment and a multi-label, multi-site classification of magnetic resonance images.

Adversarially-regularized mixed effects deep learning (ARMED) models for improved interpretability, performance, and generalization on clustered data

  • K. NguyenA. Montillo
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2023
ARMED models better distinguish confounded from true associations in synthetic data and emphasize more biologically plausible features in clinical applications and improves accuracy on data from clusters seen during training and generalization to unseen clusters.



Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior Knowledge

A unified residual framework is described that encapsulates existing approaches, and using this framework, a novel method is presented, HCP (Hidden Covariates with Prior), which performs as well or better than existing approaches while having a much lower computational cost.

Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks

  • M. MaroufPierre Machart S. Bonn
  • Biology, Computer Science
  • 2018
Conditional single cell Generative Adversarial Neural Networks (cscGANs) outperform existing methods for single cell RNA-seq data generation in quality and hold great promise for the realistic generation and augmentation of other biomedical data types.

mixup: Beyond Empirical Risk Minimization

This work proposes mixup, a simple learning principle that trains a neural network on convex combinations of pairs of examples and their labels, which improves the generalization of state-of-the-art neural network architectures.

Adjusting batch effects in microarray expression data using empirical Bayes methods.

Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches

Domain-Adversarial Training of Neural Networks

A new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions, which can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer.

Generative Adversarial Nets

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a

Tackling the widespread and critical impact of batch effects in high-throughput data

It is argued that batch effects (as well as other technical and biological artefacts) are widespread and critical to address and experimental and computational approaches for doing so are reviewed.

Bayesian Canonical correlation analysis

This work introduces a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies between data sets but also decomposes the data into shared and data set-specific components.

Machine learning applications in genetics and genomics

An overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data is provided.

Strategies for discovering novel cancer biomarkers through utilization of emerging technologies

Despite the fact that new technologies and strategies often fail to identify well-established cancer biomarkers and show a bias toward the identification of high-abundance molecules, these technological advances have the capacity to revolutionize biomarker discovery.