Flexible Indexing of Repetitive Collections

@inproceedings{Belazzougui2017FlexibleIO,
  title={Flexible Indexing of Repetitive Collections},
  author={Djamal Belazzougui and Fabio Cunial and Travis Gagie and Nicola Prezza and Mathieu Raffinot},
  booktitle={Conference on Computability in Europe},
  year={2017}
}
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries… 

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

Indexing Highly Repetitive String Collections, Part I

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

Indexing Highly Repetitive String Collections

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that has been proposed, comparing them both in theoretical and practical aspects.

Indexing Highly Repetitive String Collections, Part II

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

Resolution of the Burrows-Wheeler Transform Conjecture

This paper shows that r=\mathcal{O}(z\log^{2}n)$ holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.

Run Compressed Rank/Select for Large Alphabets

By simple reductions to the colored predecessor problem, it is shown that the query times are optimal in the important case r ≥ 2logδ n, for an arbitrary constant δ > 0.

Pangenomic Genotyping with the Marker Array

A new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data using a novel indexing structure called the marker array.

Faster and Smaller Two-Level Index for Network-Based Trajectories

This work proposes the use of a compact data structure on the bottom level of two-level indexes to handle trajectories of moving objects that are constrained to a network.

Compressed Indexes for Repetitive Textual Datasets

References

SHOWING 1-10 OF 25 REFERENCES

Composite Repetition-Aware Data Structures

Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

CHICO: A Compressed Hybrid Index for Repetitive Collections

This paper presents an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design, and is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory.

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.

Self-indexing Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

On compressing and indexing repetitive sequences

Self-Index Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

Succinct Suffix Arrays based on Run-Length Encoding

A new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n), and it is shown that the RL FM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.

LZ77-Based Self-indexing with Faster Pattern Matching

This paper shows how, given a string S [1..n] whose LZ77 parse consists of z phrases, one can store a self-index for S in \(\mathcal{O}({z \log (n / z)})\) space such that later it can be extracted in time.

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.