De Novo NGS Data Compression

  title={De Novo NGS Data Compression},
  author={Ga{\"e}tan Benoit and Claire Lemaitre and Guillaume Rizk and Erwan Drezen and Dominique Lavenier},
  booktitle={Algorithms for Next-Generation Sequencing Data},
This chapter deals with the compression of genomic data without reference genomes. It presents various techniques which have been specifically developed to compress sequencing data in lossless or lossy modes. The chapter also provides an evaluation of different NGS data compressor tools. 


Compression of DNA sequence reads in FASTQ format
This work presents a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project.
Transformations for the compression of FASTQ quality scores of next-generation sequencing data
Experiments show that both lossy and lossless transformations are useful, and that simple coding methods, which consume less computing resources, are highly competitive, especially when random access to reads is needed.
G-SQZ: compact encoding of genomic sequence and quality data
G-SQZ is presented, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order and allows selective access without scanning and decoding from start.
Compression of high throughput sequencing data with probabilistic de Bruijn graph.
The goal of the new method Leon is to achieve compression of DNA sequences of high throughput sequencing data, without the need of a reference genome, with techniques derived from existing assembly principles, that possibly better exploit NGS data redundancy.
Disk-based compression of data from genome sequencing
This paper proposes overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only), which makes use of a conceptually simple and easily parallelizable idea of minimizers to fit the 134.0 Gbp dataset into only 5.31 GB of space.
Efficient algorithms for the compression of FASTQ files
This paper proposes novel algorithms for compressing FASTQ files and shows that the proposed algorithm is indeed competitive and performs better than the best known algorithms for this problem.
Lossy compression of quality scores in genomic data
This work presents existing compression options for quality score data, and introduces two new lossy techniques that are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation.
Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification
A scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy.
Adaptive reference-free compression of sequence quality scores
By aggregating a set of reads into a compressed index, it is found that the majority of bases can be predicted from the sequence of bases that are adjacent to them and, hence, are likely to be less informative for variant calling or other applications.
Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph
A novel reference-free method meant to compress data issued from high throughput sequencing technologies, based on a reference probabilistic de Bruijn Graph, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.