• Corpus ID: 227125453

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

  title={Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures},
  author={Gonzalo Navarro},
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly… 

Indexing Highly Repetitive String Collections, Part II

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

Subpath Queries on Compressed Graphs: a Survey

This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.

Document Retrieval Hacks

Simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured are described.

A New Class of String Transformations for Compressed Text Indexing

This paper introduces a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT, and shows that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT.

Accelerating computation on compressed data via Context-Free Grammars

This research will study an efficient method of grammar compression constructed through locally consistent parsing, and research a grammar-based compression method to improve algorithmic efficiency on abstract data types.

Towards a Definitive Compressibility Measure for Repetitive Sequences

This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space.

Efficient Construction of the BWT for Repetitive Text Using String Compression

A new semi-external algorithm that builds the Burrows–Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time using compression techniques to reduce the computational costs when the input is massive and repetitive.

Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays

A long-standing barrier is broken with a new data structure that takes O(n log σ) bits, answers suffix array queries in O(log n) time, and can be constructed in O-log σ/ √ log n time using O( n log ρ) bits of space.

A theoretical and experimental analysis of BWT variants for string collections

It is found that the differences between these BWT variants can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences.

New approaches to compressibility and repetitiveness

This thesis aims to improve the current understanding of state of the art measures like W, 1 — the smallest reachable measure to date, and X — considered a stable lower bound for repetitiveness — and introduce new measures of repetitiveness, achieving better rates of compression than the currently studied.



Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

Universal indexes for highly repetitive document collections

Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

Universal Compressed Text Indexing

Compressed Computation for Text Indexing

This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).

Composite Repetition-Aware Data Structures

Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

Flexible Indexing of Repetitive Collections

Practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text are described, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors.

Compressed full-text indexes

The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.

Random Access in Persistent Strings

This work shows how to represent the corresponding collection in O(n) space and optimal $O(\log n/ \log \log n)$ query time, which improves the previous time-space trade-offs for the problem.

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.