Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud

@article{Yang2017FalcoAQ,
  title={Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud},
  author={Andrian Yang and Michael Troup and Peijie Lin and Joshua Wing Kei Ho},
  journal={Bioinformatics},
  year={2017},
  volume={33 5},
  pages={
          767-769
        }
}
Summary Single-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellization of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data… 

Tables from this paper

Cloud based computing technologies for genomic medicine
TLDR
Several new scalable bioinformatics methods that I have developed for the analysis of scRNA-seq data are reported, including Falco - a new cloud-based framework for processing of large-scale sc RNA-seqData, and Scavenger, a new pipeline to recover false negative, non-aligned reads in RNA-sequencing data.
Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco
TLDR
The Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data.
Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq
TLDR
Cumulus is a cloud-based framework for analyzing large-scale single-cell and single-nucleus RNA sequencing data analysis that substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large- scale studies.
Bioinformatics applications on Apache Spark
TLDR
Apache Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery are surveyed to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Scalability and Validation of Big Data Bioinformatics Software
Recent Applications of RNA Sequencing in Food and Agriculture
TLDR
This chapter introduces RNA-Seq and surveys its recent food and agriculture applications, ranging from differential gene expression, variants calling and detection, allele-specific expression, alternative splicing, alternative polyadenylation site usage, microRNA profiling, circular RNAs, single-cell RNA- Seq, metatranscriptomics, and systems biology.
BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data
TLDR
A new Hadoop-based software program, termed BigFiRSt, is presented to address the problem of merging short read pairs and mining SSRs in the big data manner using cutting-edge big data technology.
Cloud Computing Enabled Big Multi-Omics Data Analytics
TLDR
The adoption of advanced cloud-based and big data technologies for processing and analyzing omics data and insights into state-of-the-art cloud bioinformatics applications are provided.
Cloud-Based Bioinformatics Tools
  • B. Calabrese
  • Computer Science
    Encyclopedia of Bioinformatics and Computational Biology
  • 2019
...
1
2
...

References

SHOWING 1-10 OF 18 REFERENCES
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision
TLDR
Tests of SparkSeq prove its scalability and overall fast performance by running the analyses of sequencing datasets, and prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes.
HTSeq—a Python framework to work with high-throughput sequencing data
TLDR
This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.
STAR: ultrafast universal RNA-seq aligner
TLDR
The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Halvade: scalable sequence analysis with MapReduce
TLDR
Halvade is a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner, and attains a significant speedup compared with running the individual tools with multithreading.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data
TLDR
This work introduces SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA).
StringTie enables improved reconstruction of a transcriptome from RNA-seq reads
TLDR
StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts produces more complete and accurate reconstructions of genes and better estimates of expression levels.
featureCounts: an efficient general purpose program for assigning sequence reads to genomic features
MOTIVATION Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
TLDR
TopHat2 is described, which incorporates many significant enhancements to TopHat, and combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.
Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma
TLDR
The genome sequence of single cells isolated from brain glioblastomas was examined, which revealed shared chromosomal changes but also extensive transcription variation, including genes related to signaling, which represent potential therapeutic targets.
...
1
2
...