• Corpus ID: 209988898

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology.

  title={A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology.},
  author={Nima S. Hejazi and Philippe Boileau and Mark J. van der Laan and Alan E. Hubbard},
  journal={arXiv: Methodology},
The widespread availability of high-dimensional biological sequencing data has made the simultaneous screening of numerous biological characteristics a central statistical problem in computational biology. While the dimensionality of such data sets continues to increase, the problem of teasing out the effects of biomarkers in studies measuring baseline confounders while avoiding model misspecification remains only partially addressed. Efficient estimators constructed from data adaptive… 

Figures from this paper

A Flexible Approach for Predictive Biomarker Discovery
procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.
Comparison of microbiome samples: methods and computational challenges
Current solutions for three key challenges in the comparison of metagenomic next-generation sequencing data sets are presented, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
Averaging causal estimators in high dimensions
It is shown theoretically that averaging provides robustness against choosing a bad model, and empirically via simulation that the averaging estimator performs quite well, and in most cases nearly as well as the best among all possible candidate estimators.


Targeted Learning: Causal Inference for Observational and Experimental Data
This work focuses on TMLE in Adaptive Group Sequential Covariate Adjusted RCTs, which involves cross-Validated Targeted Minimum-Loss-Based Estimation and targeted Bayesian Learning.
Causality: Models, Reasoning and Inference
1. Introduction to probabilities, graphs, and causal models 2. A theory of inferred causation 3. Causal diagrams and the identification of causal effects 4. Actions, plans, and direct effects 5.
Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments
  • G. Smyth
  • Mathematics
    Statistical applications in genetics and molecular biology
  • 2004
The hierarchical model of Lonnstedt and Speed (2002) is developed into a practical approach for general microarray experiments with arbitrary numbers of treatments and RNA samples and the moderated t-statistic is shown to follow a t-distribution with augmented degrees of freedom.
limma: Linear Models for Microarray Data
This chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments with technical as well as biological replication.
Biomarker discovery using targeted maximum‐likelihood estimation: Application to the treatment of antiretroviral‐resistant HIV infection
A new approach to research questions of this type, based on targeted maximum‐likelihood estimation of variable importance measures is introduced, which aims to learn which of a set of candidate biomarkers is important in determining a given outcome.
biotmle: Targeted Learning for Biomarker Discovery
The biotmle package provides an implementation of a biomarker discovery methodology based on targeted minimum loss-Based estimation (TMLE) and a generalization of the moderated t-statistic of (Smyth 2004), designed for use with biological sequencing data.
Assessing exposure effects on gene expression
The regression, IPW, and g-formula approaches to exposure effect estimation are compared herein using simulations; advantages and disadvantages of each approach are explored.
Big Data, Small Sample: Edgeworth Expansions Provide a Cautionary Tale
Multiple comparisons and small sample size, common characteristics of many types of “Big Data” including those that are produced by genomic studies, present specific challenges that affect
A Simple Sequentially Rejective Multiple Test Procedure
This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one at a tine until no further rejections can be done. It
Oracle inequalities for multi-fold cross validation
The results are extended to penalized cross validation in order to control unbounded loss functions and applications include regression with squared and absolute deviation loss and classification under Tsybakov’s condition.