Measuring reproducibility of high-throughput experiments

@article{Li2011MeasuringRO,
  title={Measuring reproducibility of high-throughput experiments},
  author={Qunhua Li and James B. Brown and Haiyan Huang and Peter J. Bickel},
  journal={The Annals of Applied Statistics},
  year={2011},
  volume={5},
  pages={1752-1779}
}
Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model… 

Figures and Tables from this paper

Maximum Rank Reproducibility: A Nonparametric Approach to Assessing Reproducibility in Replicate Experiments
TLDR
The procedure, which is called the maximum rank reproducibility (MaRR) procedure, uses a maximum rank statistic to parse reproducible signals from noise without making assumptions about the distribution of reproduced signals.
Quantitative reproducibility analysis for identifying reproducible targets from high-throughput experiments
TLDR
This paper proposes a new method for identifying reproducible targets using a Bayesian hierarchical model and shows that the test statistics from replicate experiments follow a mixture of multivariate Gaussian distributions, with the one component with zero-mean representing the irreproducible targets.
Quantify and control reproducibility in high-throughput experiments.
TLDR
A set of computational methods, INTRIGUE, is proposed to evaluate and control reproducibility in high-throughput settings and is built on a new definition of Reproducibility that emphasizes directional consistency when experimental units are assessed with signed effect size estimates.
A statistical framework for measuring reproducibility and replicability of high-throughput experiments from multiple sources
TLDR
A novel statistical model is introduced to measure the reproducibility and replicability of findings from replicate experiments in multi-source studies using a nested copula mixture model that characterizes the interdependence between replication experiments both across and within sources.
Assessing the validity and reproducibility of genome-scale predictions
TLDR
An existing statistical model is described that is very well suited to assessing the reproducibility of validation experiments, and it is applied to a genome-scale study of adenosine deaminase acting on RNA (ADAR)-mediated RNA editing in Drosophila.
Segmented correspondence curve regression model for quantifying reproducibility of high-throughput experiments
TLDR
A novel segmented regression model is developed, based on the rank concordance between candidates from different replicate samples, to characterize the varying effects of operational factors for candidates at different levels of significance, and yields a well-calibrated type I error.
Assessing Reproducibility of High-throughput Experiments in the Case of Missing Data
TLDR
This paper develops a regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors when a large number of measurements are missing, and extends correspondence curve regression (CCR) to incorporate missing values.
A note on statistical repeatability and study design for high‐throughput assays
TLDR
This work provides guidance and software for estimation and visualization of repeatability of high‐throughput assays, and for its incorporation into study design, based on repeatability—a long‐established statistical quantity also known as the intraclass correlation coefficient.
Measuring Reproducibility of High-Throughput Deep-Sequencing Experiments Based on Self-adaptive Mixture Copula
TLDR
Experiments indicate that compared with IDR, the SaMiC method can better estimate reproducibility between replicate samples and can self-adaptively tune its coefficients so that the measurement of reproducecibility is more effective for general distributions.
A regression framework for assessing covariate effects on the reproducibility of high‐throughput experiments
TLDR
This article proposes a regression framework, based on a novel cumulative link model, to assess the covariate effects of operational factors on the reproducibility of findings from high‐throughput experiments and shows that this method produces calibrated type I error and is more powerful in detecting difference in Reproducibility than existing measures of agreement.
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.
TLDR
It is concluded that designing experiments with replications will greatly reduce misclassification rates and it is recommended that at least three replicates be used in designing experiments by using cDNA microarrays, particularly when gene expression data from single specimens are being analyzed.
The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.
TLDR
This study describes the experimental design and probe mapping efforts behind the MicroArray Quality Control project and shows intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed.
The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements
TLDR
This study describes the experimental design and probe mapping efforts behind the MicroArray Quality Control project and shows intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed.
Local False Discovery Rates
TLDR
This paper uses local false discovery rate methods to carry out size and power calculations on large-scale data sets and an empirical Bayes approach allows the fdr analysis to proceed from a minimum of frequentist or Bayesian modeling assumptions.
A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies
TLDR
Using probe sequences matched at the exon level improved consistency of measurements across the different microarray platforms compared to annotation-based matches, and generally, consistency was good for highly expressed genes, and variable for genes with lower expression values as confirmed by quantitative real-time (QRT)-PCR.
PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls
TLDR
A general scoring approach to address unique challenges in ChIP-seq data analysis is described, based on the observation that sites of potential binding are strongly correlated with signal peaks in the control, likely revealing features of open chromatin.
Design and analysis of ChIP-seq experiments for DNA-binding proteins
TLDR
This work compares the sensitivity and spatial precision of three peak detection algorithms with published methods, and provides a method for estimating the sequencing depth necessary for a desired coverage of protein binding sites.
Mapping and quantifying mammalian transcriptomes by RNA-Seq
TLDR
Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors.
A direct approach to false discovery rates
TLDR
The calculation of the q‐value is discussed, the pFDR analogue of the p‐value, which eliminates the need to set the error rate beforehand as is traditionally done, and can yield an increase of over eight times in power compared with the Benjamini–Hochberg FDR method.
The positive false discovery rate: a Bayesian interpretation and the q-value
TLDR
This work introduces a modified version of the FDR called the “positive false discovery rate” (pFDR), which can be written as a Bayesian posterior probability and can be connected to classification theory.
...
...