High-order statistical compressor for long-term storage of DNA sequencing data

@article{Chlopkowski2016HighorderSC,
  title={High-order statistical compressor for long-term storage of DNA sequencing data},
  author={Marek Chlopkowski and Maciej Antczak and Michal Slusarczyk and Aleksander Wdowinski and Michal Zajaczkowski and Marta Kasprzak},
  journal={RAIRO Oper. Res.},
  year={2016},
  volume={50},
  pages={351-361}
}
We present a specialized compressor designed for efficient data storage of FASTQ files produced by high-throughput DNA sequencers. Since the method has been optimized for compression quality, it is especially suitable for long-term storage and for genome research centers processing huge amount of data (counted in petabytes). The proposed compressor uses high-order statistical models for range encoding, similar to Markov models, but the whole input is considered in building a symbol context… 

Figures and Tables from this paper

Efficient Storage of Genomic Sequences in High Performance Computing Systems
TLDR
A novel compression workflow that aims at improving the usability of referential compressors for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy is proposed.
Tackling the Challenges of FASTQ Referential Compression
TLDR
A novel approach for referential compression of FASTQ files based on the combination of local alignments with binary encoding optimized for long reads is introduced, named UdeACompress.
Recent Advances in Operations Research in Computational Biology, Bioinformatics and Medicine
TLDR
This special issue of RAIRO-OR includes nine papers that were selected among forty presentations and included in this special issue after two rounds of reviewing.
Study of biological networks using graph theory

References

SHOWING 1-10 OF 21 REFERENCES
Compression of DNA sequence reads in FASTQ format
TLDR
This work presents a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project.
Compressing Genomic Sequence Fragments Using SlimGene
TLDR
A set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6× are introduced and the study of using "lossy" quality values is initiated.
Preprocessing and storing high-throughput sequencing data
TLDR
A pipeline of preprocessing the initial set of short sequences that is removing low quality reads and duplicated reads is developed, and a method for preliminary joining overlapping sequences is proposed, decreasing the cardinality of initial sets to 13.9% and 27.8%.
SCALCE: boosting sequence compression algorithms using locally consistent encoding
TLDR
SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome is presented.
DSRC 2 - Industry-oriented compression of FASTQ files
TLDR
This work proposes DSRC 2, a compression ratio package that offers compression ratios comparable with the best existing solutions, while being a few times faster and more flexible.
Compression of next-generation sequencing reads aided by highly efficient de novo assembly
TLDR
Quip is presented, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats and the first assembly-based compressor, using a novel de novo assembly algorithm.
Data compression for sequencing data
TLDR
This review answers the question “why compression” in a quantitative manner, and gives other, perhaps surprising answers, demonstrating the pervasiveness of datacompression techniques in computational biology.
A general purpose lossless data compression method for GPU
A Technique for High-Performance Data Compression
TLDR
A new compression algorithm is introduced that is based on principles not found in existing commercial methods in that it dynamically adapts to the redundancy characteristics of the data being compressed, and serves to illustrate system problems inherent in using any compression scheme.
Data Compression: The Complete Reference
TLDR
Detailed descriptions and explanations of the most well-known and frequently used compression methods are covered in a self-contained fashion, with an accessible style and technical level for specialists and nonspecialists.
...
...