# Flexible Indexing of Repetitive Collections

@inproceedings{Belazzougui2017FlexibleIO,
title={Flexible Indexing of Repetitive Collections},
author={Djamal Belazzougui and Fabio Cunial and Travis Gagie and Nicola Prezza and Mathieu Raffinot},
booktitle={Conference on Computability in Europe},
year={2017}
}
• Published in
Conference on Computability…
12 June 2017
• Computer Science
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such variant uses an amount of space comparable to LZ77 indexes, but it answers count queries…

### Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

### Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

### Indexing Highly Repetitive String Collections, Part I

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

### Indexing Highly Repetitive String Collections

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that has been proposed, comparing them both in theoretical and practical aspects.

### Indexing Highly Repetitive String Collections, Part II

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

### Resolution of the Burrows-Wheeler Transform Conjecture

• Computer Science
2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)
• 2020
This paper shows that r=\mathcal{O}(z\log^{2}n)\$ holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.

### Run Compressed Rank/Select for Large Alphabets

• Computer Science
2018 Data Compression Conference
• 2018
By simple reductions to the colored predecessor problem, it is shown that the query times are optimal in the important case r ≥ 2logδ n, for an arbitrary constant δ > 0.

### Pangenomic Genotyping with the Marker Array

• Computer Science
bioRxiv
• 2022
A new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data using a novel indexing structure called the marker array.

### Faster and Smaller Two-Level Index for Network-Based Trajectories

• Computer Science
SPIRE
• 2018
This work proposes the use of a compact data structure on the bottom level of two-level indexes to handle trajectories of moving objects that are constrained to a network.

### Compressed Indexes for Repetitive Textual Datasets

• Computer Science
Encyclopedia of Big Data Technologies
• 2019

## References

SHOWING 1-10 OF 25 REFERENCES

### Composite Repetition-Aware Data Structures

• Computer Science
CPM
• 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

### Storage and Retrieval of Highly Repetitive Sequence Collections

• Biology
J. Comput. Biol.
• 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

### CHICO: A Compressed Hybrid Index for Repetitive Collections

This paper presents an implementation of an hybrid index that combines the effectiveness of Lempel-Ziv factorization with a modular design, and is able to successfully index thousands of genomes in a commodity desktop, and it scales up to multi-terabyte collections, provided there is enough secondary memory.

### Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

• Computer Science
SPIRE
• 2008
It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.

### Self-indexing Based on LZ77

• Computer Science
CPM
• 2011
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

### Self-Index Based on LZ77

• Computer Science
ArXiv
• 2011
We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes

### Succinct Suffix Arrays based on Run-Length Encoding

• Computer Science
Nord. J. Comput.
• 2005
A new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n), and it is shown that the RL FM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.

### LZ77-Based Self-indexing with Faster Pattern Matching

• Computer Science
LATIN
• 2014
This paper shows how, given a string S [1..n] whose LZ77 parse consists of z phrases, one can store a self-index for S in $$\mathcal{O}({z \log (n / z)})$$ space such that later it can be extracted in time.

### Stronger Lempel-Ziv Based Compressed Text Indexing

• Computer Science
Algorithmica
• 2010
Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.