Efficient Compression of Genomic Sequences

@article{Pratas2016EfficientCO,
  title={Efficient Compression of Genomic Sequences},
  author={Diogo Pratas and Armando J. Pinho and Paulo Jorge S. G. Ferreira},
  journal={2016 Data Compression Conference (DCC)},
  year={2016},
  pages={231-240}
}
The number of genomic sequences is growing substantially. Besides discarding part of the data, the only efficient possibility for coping with this trend is data compression. We present an efficient compressor for genomic sequences, allowing both reference-free and referential compression. This compressor uses a mixture of context models of several orders, according to two model classes: reference and target. A new type of context model, which is capable of tolerating substitution errors, is… Expand

Figures, Tables, and Topics from this paper

AC: A Compression Tool for Amino Acid Sequences
TLDR
AC, a state-of-the-art method for lossless compression of amino acid sequences, works based on the cooperation between finite-context models and substitutional tolerant Markov models and provides the best bit-rates. Expand
Improve the compression of bacterial DNA sequence
TLDR
A new, loss-less and reference-free compression method to increase the compression performance of Bacterial DNA Sequences using Bzip2, showing a decreasing in compression ratio by 7.74% on average with compression speed about ten times faster than ever before. Expand
FQSqueezer: k-mer-based compression of sequencing data
  • S. Deorowicz
  • Computer Science, Medicine
  • Scientific Reports
  • 2020
TLDR
FQSqueezer is presented, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. Expand
A Survey on Data Compression Methods for Biological Sequences
TLDR
A comprehensive survey of existing compression approaches, that are specialized for biological data, including protein and DNA sequences, and a comparison of the performance of several methods, in terms of compression ratio, memory usage and compression/decompression time. Expand
Substitutional Tolerant Markov Models for Relative Compression of DNA Sequences
TLDR
A new model is proposed, the substitutional tolerant Markov model (STMM), which can be used in cooperation with regular Markov models to improve compression efficiency, and shows high efficiency in modeling species that have split less than 40 million years ago. Expand
Efficient DNA sequence compression with neural networks
TLDR
GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models, is created and benchmarked as a reference-free DNA compressor in 5 datasets. Expand
AliCo: A New Efficient Representation for SAM Files
TLDR
This work presents AliCo, a new compression method tailored to the aligned data represented in the SAM format that outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. Expand
GeCo2: An Optimized Tool for Lossless Compression and Analysis of DNA Sequences
TLDR
The GeCo2 tool is described, an improved version of the GeCo tool, that permits more flexibility for compression and analysis purposes, namely a higher ability of addressing different characteristics of the DNA sequences. Expand
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
TLDR
A new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms is described, which attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. Expand
Sequence Compression Benchmark (SCB) database — a comprehensive evaluation of reference-free compressors for FASTA-formatted sequences
TLDR
The Sequence Compression Benchmark database allows comparing compressors and their settings using a variety of performance measures, offering the opportunity to select the optimal compressor based on the data type and usage scenario specific to particular application. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 22 REFERENCES
GReEn: a tool for efficient compression of genome resequencing data
TLDR
GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence, overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by G RS, faster running times and compression gains of over 100-fold for some sequences. Expand
DELIMINATE - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis
TLDR
A novel compression algorithm (DELIMinATE) that can rapidly compress genomic sequence data in a loss-less fashion is presented and validation results indicate relatively higher compression efficiency of DELIMINATE when compared with popular general purpose compression algorithms, namely, gzip, bzip2 and lzma. Expand
Disk-based compression of data from genome sequencing
TLDR
This paper proposes overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only), which makes use of a conceptually simple and easily parallelizable idea of minimizers to fit the 134.0 Gbp dataset into only 5.31 GB of space. Expand
BIND – An algorithm for loss-less compression of nucleotide sequence data
Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by theseExpand
FRESCO: Referential Compression of Highly Similar Sequences
  • S. Wandelt, U. Leser
  • Computer Science, Medicine
  • IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • 2013
TLDR
A general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO), and a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). Expand
GDC 2: Compression of large collections of genomes
TLDR
This paper proposes an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times, which is about 4 times better than what is offered by the other existing compressors. Expand
Optimized Relative Lempel-Ziv Compression of Genomes
TLDR
It is found that simple non-greedy parsings can significantly improve compression performance and discover a strong correlation between the starting positions of long factors and their positions in a reference genome. Expand
MFCompress: a compression tool for FASTA and multi-FASTA data
TLDR
MFCompress is described, specially designed for the compression of FASTA and multi-FASTA files, which can provide additional average compression gains of almost 50%, and potentially doubles the available storage, although at the cost of some more computation time. Expand
A novel compression tool for efficient storage of genome resequencing data
TLDR
A novel compression tool for storing and analyzing Genome ReSequencing data, named GRS, which is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequenceData using the reference genome sequence. Expand
Compression of DNA sequence reads in FASTQ format
TLDR
This work presents a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project. Expand
...
1
2
3
...