Fast and accurate approximate inference of transcript expression from RNA-seq data

@article{Hensman2014FastAA,
  title={Fast and accurate approximate inference of transcript expression from RNA-seq data},
  author={James Hensman and Panagiotis Papastamoulis and Peter Glaus and Antti Honkela and Magnus Rattray},
  journal={Bioinformatics},
  year={2014},
  volume={31},
  pages={3881 - 3889}
}
Motivation: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared with competing methods. However, exact Bayesian inference is intractable and approximate methods such as… 

Figures and Tables from this paper

Bayesian estimation of differential transcript usage from RNA-seq data

The use of cjBitSeq is extended to the DTU context, a previously introduced Bayesian model which is originally designed for identifying changes in overall expression levels is proposed and a Bayesian version of DRIMSeq, a frequentist model for inferring DTU is proposed.

Perplexity: evaluating transcript abundance estimation in the absence of ground truth

This study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth, and derives perplexity from the analogous metric used to evaluate language and topic models and extends the metric to carefully account for corner cases unique to RNA-seq.

Polee: RNA-Seq analysis using approximate likelihood

This work proposes a new method of approximating the likelihood function of a sparse mixture model, using a technique the authors call the Pólya tree transformation, and demonstrates that substituting this approximation for the real thing achieves most of the benefits with a fraction of the computational costs, leading to more accurate detection of differential transcript expression.

A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data

A hierarchical Bayesian model builds on the BitSeq framework and the posterior distribution of transcript expression and differential expression is inferred by using Markov chain Monte Carlo sampling, and it is shown that the model proposed enjoys conjugacy for fixed dimension variables; thus the full conditional distributions are analytically derived.

Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability

Methods to calculate a “confidence range of expression” for each transcript, representing its possible abundance across equally optimal estimates for both quantification models are proposed, informing both whether a transcript has potential estimation error due to non-identifiability and the extent of the error.

Fast and accurate quantification and differential analysis of transcriptomes

Improvements to both abundance estimation and differential expression analysis are presented, showing dramatic improvements to the speed of abundance estimation while maintaining accuracy, and a differential expression model is developed incorporating the uncertainty introduced by abundance estimation.

Detecting anomalies in RNA-seq quantification

This work develops a computational method to detect instances where a quantification model could not thoroughly explain the input, and identifies transcripts where the read coverage has significant deviations from the expectation.

Finding ranges of optimal transcript expression quantification in cases of non-identifiability

Methods to compute the range of equally optimal estimates for the expression of each transcript, accounting for non-identifiability of the quantification model using several novel graph theoretical approaches are proposed.

Combining Multiple RNA-Seq Data Analysis Algorithms Using Machine Learning Improves Differential Isoform Expression Analysis

A novel integrative approach that effectively combines the most widely used algorithms for differential transcript and isoform analysis using state-of-the-art machine learning techniques is developed and concludes that the strategy outperforms the application of the individual algorithms.

Improved data-driven likelihood factorizations for transcript abundance estimation

This work demonstrates that model simplifications adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts, and shows that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per‐fragment) likelihood, while retaining the computational efficiently of the compatibility‐based factorizations.

References

SHOWING 1-10 OF 36 REFERENCES

Improved variational Bayes inference for transcript expression estimation

In this paper, variational Bayesian techniques are used in order to approximate the posterior distribution of transcript expression and a novel approach is introduced which integrates the latent allocation variables out of the VB approximation.

TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference

A statistical method to estimate transcript isoform abundances from RNA-Seq data that optimizes the number of transcript isoforms by variational Bayesian inference through an iterative procedure, and its convergence is guaranteed under a stopping criterion.

Identifying differentially expressed transcripts from RNA-seq data with biological variation

A novel method for DE analysis across replicates is proposed which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior, and the advantages of this method are demonstrated.

RNA-Seq gene expression estimation with read mapping uncertainty

Simulations with the method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed, and the method is capable of modeling non-uniform read distributions.

Statistical inferences for isoform expression in RNA-Seq

The results show that isoform expression inference in RNA-Seq is possible by employing appropriate statistical methods and statistical inferences are obtained from the posterior distribution by importance sampling.

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

The results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

Analysis and design of RNA sequencing experiments for identifying isoform regulation

The mixture-of-isoforms (MISO) model is developed, a statistical model that estimates expression of alternatively spliced exons and isoforms and assesses confidence in these estimates, providing a probabilistic framework for RNA-seq analysis and functional insights into pre-mRNA processing.

Mapping and quantifying mammalian transcriptomes by RNA-Seq

Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors.

QUANTIFYING ALTERNATIVE SPLICING FROM PAIRED-END RNA-SEQUENCING DATA.

Novel data summaries and a Bayesian modeling framework are proposed that overcome limitations and determine biases in a non-parametric, highly flexible manner and allow to study alternative splicing patterns for individual samples and can also be the basis for downstream analyses.