Near-optimal probabilistic RNA-seq quantification

  title={Near-optimal probabilistic RNA-seq quantification},
  author={Nicolas L. Bray and Harold Pimentel and P{\'a}ll Melsted and Lior Pachter},
  journal={Nature Biotechnology},
We present kallisto, an RNA-seq quantification program that is two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases. We use kallisto to analyze 30 million unaligned paired-end RNA-seq reads in <10 min on a standard laptop computer. This removes a major computational bottleneck in RNA-seq analysis. 

TACO produces robust multi-sample transcriptome assemblies from RNA-seq

TACO employs novel change-point detection to demarcate transcript start and end sites, leading to improved reconstruction accuracy compared with other tools in its class.

Fleximer: Accurate Quantification of RNA-Seq via Variable-Length k-mers

A novel method is proposed, Fleximer, to efficiently discover and select an optimal set of k-mers with flexible lengths, and is shown to be able to cover the similar amount of reads as Sailfish and Kallisto.

RNA-seq transcript quantification from reduced-representation data in recount2

A linear model is presented taking as input summary coverage of junctions and subdivided exons to output estimated abundances and associated uncertainty, and a procedure to construct confidence intervals for estimates is provided.

Partitioning RNAs by length improves transcriptome reconstruction from short-read RNA-seq data

Ladder-seq is introduced, an approach that separates transcripts according to their lengths before sequencing and uses the additional information to improve the quantification and assembly of transcripts and reveals 40% more genes harboring isoform switches compared to conventional RNA sequencing.

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments, is presented, which successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features.

Interoperable RNA-Seq analysis in the cloud.

Limitations of alignment-free tools in total RNA-seq quantification

A potential pitfall is identified in analyzing and quantifying lowly-expressed genes and smallRNAs with alignment-free pipelines, especially when these small RNAs contain biological variations.

Polee: RNA-Seq analysis using approximate likelihood

This work proposes a new method of approximating the likelihood function of a sparse mixture model, using a technique the authors call the Pólya tree transformation, and demonstrates that substituting this approximation for the real thing achieves most of the benefits with a fraction of the computational costs, leading to more accurate detection of differential transcript expression.

RNA-Seq Experiment and Data Analysis.

The protocols described in this chapter can be applied to the analysis of differential gene expression in control versus 17β-estradiol treatment of in vivo or in vitro systems.

DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition

Applying DE-kupl to humanRNA-seq data sets identified multiple types of novel events, reproducibly across independent RNA-seq experiments, including differential long non-coding RNAs, splice and polyadenylation variants and exogenous RNA.



Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms

Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data, exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.

Improving RNA-Seq expression estimates by correcting for fragment bias

Improvements in expression estimates as measured by correlation with independently performed qRT-PCR are found and correction of bias leads to improved replicability of results across libraries and sequencing technologies.

EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering

EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts.

Estimation of alternative splicing isoform frequencies from RNA-Seq data

A novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available.

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

The results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.

Streaming fragment assignment for real-time analysis of sequencing experiments

eXpress is a software package for efficient probabilistic assignment of ambiguously mapping sequenced fragments that can determine abundances of sequenced molecules in real time and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

TopHat2 is described, which incorporates many significant enhancements to TopHat, and combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.

Mapping and quantifying mammalian transcriptomes by RNA-Seq

Although >90% of uniquely mapped reads fell within known exons, the remaining data suggest new and revised gene models, including changed or additional promoters, exons and 3′ untranscribed regions, as well as new candidate microRNA precursors.

RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays.

It is found that the Illumina sequencing data are highly replicable, with relatively little technical variation, and thus, for many purposes, it may suffice to sequence each mRNA sample only once (i.e., using one lane).