Composite Repetition-Aware Data Structures

  title={Composite Repetition-Aware Data Structures},
  author={Djamal Belazzougui and Fabio Cunial and Travis Gagie and Nicola Prezza and Mathieu Raffinot},
In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the… 

Practical combinations of repetition-aware data structures

This paper explores the practical advantages of combining data structures whose size depends on distinct measures of repetition, and describes a range of practical variants that combine RLBWT with the set of boundaries of the Lempel-Ziv 77 factors of a string, which take space proportional to the number of factors.

Flexible Indexing of Repetitive Collections

Practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text are described, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors.

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.

Optimal-Time Text Indexing in BWT-runs Bounded Space

This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$ space, on a RAM machine of $w=\Omega(\log n)$ bits.

Practical and Flexible Indexes on Repetitive String Collections

The main goal is to develop practical and flexible succinct indexes to support pattern matching and document retrieval operations on repetitive string collections.

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

Universal Compressed Text Indexing 1

This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.

Universal Compressed Text Indexing

CHICO: A Compressed Hybrid Index for Repetitive Collections

This paper presents an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design, and is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory.



Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.

On compressing and indexing repetitive sequences

The structure of subword graphs and suffix trees of Fibonacci words

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.

Lempel-Ziv parsing and sublinear-size index structures for string matching

The rst sublinear-size index structure is presented, based on Lempel-Ziv parsing of the text and has size linear in N, the size of the Lempel -Ziv parse.

On maximal repeats in strings

LZ77-Based Self-indexing with Faster Pattern Matching

This paper shows how, given a string S [1..n] whose LZ77 parse consists of z phrases, one can store a self-index for S in \(\mathcal{O}({z \log (n / z)})\) space such that later it can be extracted in time.

Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform

P succinct and compact representations of the bidirectional bwt of a string s ∈ Σ* which provide increasing navigation power and a number of space-time tradeoffs are described, resulting in near-linear time algorithms for many sequence analysis problems for the first time in succinct space.