Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce

@article{Decap2017HalvadeRNAPV,
  title={Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce},
  author={Dries Decap and Joke Reumers and Charlotte Herzeel and Pascal Costanza and Jan Fostier},
  journal={PLoS ONE},
  year={2017},
  volume={12}
}
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel… 
HSRA: Hadoop-based spliced read aligner for RNA sequencing data
TLDR
HSRA is a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data to distributed memory systems such as multi-core clusters or cloud platforms.
SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark
TLDR
SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster is presented.
Halvade somatic: Somatic variant calling with Apache Spark
TLDR
Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance.
Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco
TLDR
The Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data.
Cloud based computing technologies for genomic medicine
TLDR
Several new scalable bioinformatics methods that I have developed for the analysis of scRNA-seq data are reported, including Falco - a new cloud-based framework for processing of large-scale sc RNA-seqData, and Scavenger, a new pipeline to recover false negative, non-aligned reads in RNA-sequencing data.
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce
TLDR
A fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences is proposed and it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
RNA-Seq Data Analysis, Applications and Challenges
TLDR
The wide range of applications that RNA-Seq offers to the research community, as well as, the shortcomings of the technology and the statistical challenges of the analysis are discussed.
Multithreaded variant calling in elPrep 5
We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best
SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets
TLDR
SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing framework.
Multithreaded variant calling in elPrep 5
We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best
...
1
2
...

References

SHOWING 1-10 OF 22 REFERENCES
Halvade: scalable sequence analysis with MapReduce
TLDR
Halvade is a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner, and attains a significant speedup compared with running the individual tools with multithreading.
Systematic evaluation of spliced alignment programs for RNA-seq data
TLDR
A comparison of 26 mapping protocols based on 11 programs and pipelines found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction.
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.
TLDR
The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Reliable identification of genomic variants from RNA-seq data.
TLDR
It is demonstrated that SNPiR outperforms current state-of-the-art approaches for variant detection from RNA-seq data and offers a cost-effective and reliable alternative for SNP discovery.
STAR: ultrafast universal RNA-seq aligner
TLDR
The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
CloudBurst: highly sensitive read mapping with MapReduce
  • M. Schatz
  • Computer Science, Medicine
    Bioinform.
  • 2009
TLDR
CloudBurst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping and personal genomics.
HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy
TLDR
Two software tools are developed to address the DNA MSA problem and the first employed trie trees to accelerate the centre star MSA strategy and the expected time complexity was decreased to linear time from square time.
Supercomputing for the parallelization of whole genome analysis
TLDR
The MegaSeq workflow is designed to harness the size and memory of the Cray XE6, housed at Argonne National Laboratory, for whole genome analysis in a platform designed to better match current and emerging sequencing volume.
featureCounts: an efficient general purpose program for assigning sequence reads to genomic features
MOTIVATION Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for
Scalable Genome Resequencing with ADAM and avocado by
Scalable Genome Resequencing with ADAM and avocado by Frank Austin Nothaft Master of Science in Computer Science University of California, Berkeley Professor David Patterson, Chair The decreased cost
...
1
2
3
...