On compressing and indexing repetitive sequences

@article{Kreft2013OnCA,
  title={On compressing and indexing repetitive sequences},
  author={Sebastian Kreft and Gonzalo Navarro},
  journal={Theor. Comput. Sci.},
  year={2013},
  volume={483},
  pages={115-133}
}

Figures and Tables from this paper

Compressed Computation for Text Indexing

This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).

Universal Compressed Text Indexing 1

This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.

Universal Compressed Text Indexing

Block Tree based Universal Self-Index for Repetitive Text Collections

Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others.

Universal indexes for highly repetitive document collections

Indexing Highly Repetitive Collections via Grammar Compression

This proposal will focus on the main drawbacks of Grammar-based compressors and self-indexes in repetitive collections.

Flexible Indexing of Repetitive Collections

Practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text are described, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors.

CHICO: A Compressed Hybrid Index for Repetitive Collections

This paper presents an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design, and is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory.

Indexing Highly Repetitive Collections

Progress made along three research lines to address the need to index and search huge highly repetitive sequence collections are described: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.
...

References

SHOWING 1-10 OF 76 REFERENCES

Self-indexing Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

LZ77-Like Compression with Fast Random Access

This work introduces an alternative Lempel-Ziv text parsing, LZ-End, that converges to the entropy and in practice gets very close to LZ77, which is ideal as a compression format for highly repetitive sequence databases, where access to individual sequences is required.

Indexes for highly repetitive document collections

We introduce new compressed inverted indexes for highly repetitive document collections. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists,

Self-Index Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

This paper study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length ($q$-grams), and introduces two novel techniques that constitute practical alternatives to handle this scenario.

Indexing text using the Ziv-Lempel trie

  • G. Navarro
  • Computer Science
    J. Discrete Algorithms
  • 2002

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.

Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

Repetition-Based Text Indexes

A new repetition-based q-gram index, the Lempel{Ziv index forq-grams, that has asymptotically optimal space requirement and query time provided that q is a constant or grows slowly enough with respect to the length of the text.
...