Efficient Compression of Genomic Sequences

  title={Efficient Compression of Genomic Sequences},
  author={Diogo Pratas and Armando J. Pinho and Paulo Jorge S. G. Ferreira},
  journal={2016 Data Compression Conference (DCC)},
The number of genomic sequences is growing substantially. Besides discarding part of the data, the only efficient possibility for coping with this trend is data compression. We present an efficient compressor for genomic sequences, allowing both reference-free and referential compression. This compressor uses a mixture of context models of several orders, according to two model classes: reference and target. A new type of context model, which is capable of tolerating substitution errors, is… 

Figures and Tables from this paper

AC: A Compression Tool for Amino Acid Sequences
AC, a state-of-the-art method for lossless compression of amino acid sequences, works based on the cooperation between finite-context models and substitutional tolerant Markov models and provides the best bit-rates.
Improve the compression of bacterial DNA sequence
A new, loss-less and reference-free compression method to increase the compression performance of Bacterial DNA Sequences using Bzip2, showing a decreasing in compression ratio by 7.74% on average with compression speed about ten times faster than ever before.
A Survey on Data Compression Methods for Biological Sequences
A comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences, and a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time.
Substitutional Tolerant Markov Models for Relative Compression of DNA Sequences
A new model is proposed, the substitutional tolerant Markov model (STMM), which can be used in cooperation with regular Markov models to improve compression efficiency, and shows high efficiency in modeling species that have split less than 40 million years ago.
Efficient DNA sequence compression with neural networks
GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models, is created and benchmarked as a reference-free DNA compressor in 5 datasets.
AliCo: A New Efficient Representation for SAM Files
This work presents AliCo, a new compression method tailored to the aligned data represented in the SAM format that outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode.
GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences
The GeCo2 tool is described, an improved version of the GeCo tool, that permits more flexibility for compression and analysis purposes, namely a higher ability of addressing different characteristics of the DNA sequences.
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
A new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms is described, which attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources.
Sequence Compression Benchmark (SCB) database — a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
It is found that modern compressors offer large improvement in compactness and speed compared to gzip, and the Sequence Compression Benchmark database allows the opportunity to select the optimal compressor based on the data type and usage scenario specific to particular application.
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences and Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression.


DELIMINATE - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis
A novel compression algorithm (DELIMinATE) that can rapidly compress genomic sequence data in a loss-less fashion is presented and validation results indicate relatively higher compression efficiency of DELIMINATE when compared with popular general purpose compression algorithms, namely, gzip, bzip2 and lzma.
Disk-based compression of data from genome sequencing
This paper proposes overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only), which makes use of a conceptually simple and easily parallelizable idea of minimizers to fit the 134.0 Gbp dataset into only 5.31 GB of space.
BIND – An algorithm for loss-less compression of nucleotide sequence data
Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these
FRESCO: Referential Compression of Highly Similar Sequences
  • S. WandeltU. Leser
  • Computer Science
    IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • 2013
A general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO), and a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression).
GDC 2: Compression of large collections of genomes
This paper proposes an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times, which is about 4 times better than what is offered by the other existing compressors.
MFCompress: a compression tool for FASTA and multi-FASTA data
MFCompress is described, specially designed for the compression of FASTA and multi-FASTA files, which can provide additional average compression gains of almost 50%, and potentially doubles the available storage, although at the cost of some more computation time.
Compression of DNA sequence reads in FASTQ format
This work presents a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project.
DNA-COMPACT: DNA COMpression Based on a Pattern-Aware Contextual Modeling Technique
A two-pass lossless genome compression algorithm, which highlights the synthesis of complementary contextual models, to improve the compression performance, and demonstrated performance advantages over best existing algorithms.
iDoComp: a compression scheme for assembled genomes
iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time.
Compression of FASTQ and SAM Format Sequencing Data
Several compression entries from the SequenceSqueeze contest are presented, including the winning entry, and the tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs.