Comparison of high-throughput sequencing data compression tools

  title={Comparison of high-throughput sequencing data compression tools},
  author={Ibrahim Numanagi{\'c} and James K. Bonfield and Faraz Hach and Jan Voges and J{\"o}rn Ostermann and Claudio Alberti and Marco Mattavelli and S{\"u}leyman Cenk Sahinalp},
  journal={Nature Methods},
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework. 
Performance evaluation of lossy quality compression algorithms for RNA-seq data
It is shown that lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens and depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels.
ParRefCom: Parallel Reference-based Compression of Paired-end Genomics Read Datasets
This paper presents ParRefCom, a parallel reference-based algorithm for compressing HTS genomics short-read datasets that treats paired-end reads as first-class citizens and is able to significantly improve compression efficiency over the state-of-the-art.
Dynamic Alignment-Free and Reference-Free Read Compression
DARRC addresses the problem of pangenome compression by encoding the sequences of a pANGenome as a guided de Bruijn graph and can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code.
Efficient high throughput sequencing data compression and genotyping methods for clinical environments
This thesis introduces the first computational tool which is able to accurately infer a CYP2D6 genotype from HTS data by formulating such problem as an instance of integer linear programming.
CALQ: compression of quality values of aligned sequencing data
This work presents a novel lossy compression scheme named CALQ, which performs as good as or better than the state‐of‐the‐art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets.
LW-FQZip 2: a parallelized reference-based compression of FASTQ files
The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data and to obtain promising compression ratios at reasonable time and memory space costs.
Transcriptomics and RNA-Seq Data Analysis
This chapter provides a conceptual framework for analyzing HTS data and offers numerical illustrations of solutions to both problems mentioned above, and includes examples from real data on how to compare performance of different software packages.
ARSDA: A new approach for storing, transmitting and analyzing high-throughput sequencing data
  • X. Xia
  • Computer Science
  • 2017
The proper allocation reads that map equally well to paralogous genes are illustrated and a new method for such allocation is illustrated, to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis.
An Empirical Study on Efficient Storage of Human Genome Data
This initial empirical study focuses on existing generic and domain-specific compression techniques for reducing the storage space of genome sequence data and compares erasure coding and replication in providing reliability on commodity hardware.
Optimal compressed representation of high throughput sequence data via light assembly
A new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie, which significantly improves the compression performance of alternatives without compromising speed.


Predictive Coding of Aligned Next-Generation Sequencing Data
The proposed algorithm combines alignment information to implicitly assemble local parts of the donor genome in order to compress the sequence reads, and yields compression results on par or better than the state-of-the-art.
Data compression for sequencing data
This review answers the question “why compression” in a quantitative manner, and gives other, perhaps surprising answers, demonstrating the pervasiveness of datacompression techniques in computational biology.
Disk-based compression of data from genome sequencing
This paper proposes overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only), which makes use of a conceptually simple and easily parallelizable idea of minimizers to fit the 134.0 Gbp dataset into only 5.31 GB of space.
Compression of next-generation sequencing reads aided by highly efficient de novo assembly
Quip is presented, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats and the first assembly-based compressor, using a novel de novo assembly algorithm.
Aligned genomic data compression via improved modeling
The results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) does not apply for high coverage aligned data.
SCALCE: boosting sequence compression algorithms using locally consistent encoding
SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome is presented.
Reference-based compression of short-read sequences using path encoding
An approach to compression that reduces the difficulty of managing large-scale sequencing data is presented and is able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches.
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
A novel reference-free method meant to compress data issued from high throughput sequencing technologies, based on a reference probabilistic de Bruijn Graph, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.
The Sequence Alignment/Map format and SAMtools
Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by
Compression of FASTQ and SAM Format Sequencing Data
Several compression entries from the SequenceSqueeze contest are presented, including the winning entry, and the tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs.