BEETL-fastq: a searchable compressed archive for DNA reads

  title={BEETL-fastq: a searchable compressed archive for DNA reads},
  author={Lilian Janin and Ole Schulz-Trieglaff and Anthony J. Cox},
  volume={30 19},
MOTIVATION FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be… 

Figures and Tables from this paper

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes

The concept of a population BWT is introduced and used to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project and it is shown that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient.

Compressing and Indexing Aligned Readsets

This paper builds a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments, and compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index.

ARSDA: A new approach for storing, transmitting and analyzing high-throughput sequencing data

  • X. Xia
  • Computer Science
  • 2017
The proper allocation reads that map equally well to paralogous genes are illustrated and a new method for such allocation is illustrated, to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis.

Compression of short-read sequences using path encoding

This work presents an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing and offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference- based approaches with the flexibility of de novo encoding.

ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data

  • X. Xia
  • Computer Science
    G3: Genes, Genomes, Genetics
  • 2017
ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization, and contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSda.

Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees

PAC is presented, a novel approximate membership query data structure for querying collections of sequence datasets that shows a 3 to 6 fold improvement in construction time compared to other compressed methods for comparable index size and its ability to query 500,000 transcript sequences in less than an hour.

Reference-based compression of short-read sequences using path encoding

An approach to compression that reduces the difficulty of managing large-scale sequencing data is presented and is able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

A novel reference-free method meant to compress data issued from high throughput sequencing technologies, based on a reference probabilistic de Bruijn Graph, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.

CoGI: Towards Compressing Genomes as an Image

This paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image.

Lossy Compression of Quality Values in Sequencing Data

This study addresses the compression of SAM files, the standard output files for DNA alignment, and introduces a new lossy model, dynamic binning, and compares its performance to other lossy techniques, namely Illuminabinning, LEON and QVZ.



GPU-Accelerated BWT Construction for Large Collection of Short Reads

CX1 is the first tool that can take advantage of the parallelism given by a graphics processing unit (GPU, a relative cheap device providing a thousand or more primitive cores), as well as simultaneously the Parallelism from a multi-core CPU and more interestingly, from a cluster of GPU-enabled nodes.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

It is demonstrated that compression may be greatly improved by a particular reordering of the sequences in the collection and a novel 'implicit sorting' strategy is given that enables these benefits to be realized without the overhead of sorting the reads.

Fast and accurate short read alignment with Burrows–Wheeler transform

Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

Adaptive reference-free compression of sequence quality scores

By aggregating a set of reads into a compressed index, it is found that the majority of bases can be predicted from the sequence of bases that are adjacent to them and, hence, are likely to be less informative for variant calling or other applications.

Efficient de novo assembly of large genomes using compressed data structures.

A new assembler based on the overlap-based string graph model of assembly, SGA (String Graph Assembler), which provides the first practical assembler for a mammalian-sized genome on a low-end computing cluster and is simply parallelizable.

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

The FASTQ format is defined, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS.

Lightweight BWT Construction for Very Large String Collections

The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m, and apply to any string collection over any alphabet.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching

This book will serve as a reference for seasoned professionals or researchers in the area, while providing a gentle introduction, making it accessible for senior undergraduate students or first year graduate students embarking upon research in compression, pattern matching, full text retrieval, compressed index structures, or other areas related to the BWT.