Universal Compressed Text Indexing

  title={Universal Compressed Text Indexing},
  author={Gonzalo Navarro and Nicola Prezza},
  journal={Theor. Comput. Sci.},

Figures from this paper

Practical and Flexible Indexes on Repetitive String Collections

The main goal is to develop practical and flexible succinct indexes to support pattern matching and document retrieval operations on repetitive string collections.

On Locating Paths in Compressed Cardinal Trees

This paper shows for the first time how to support the powerful locate queries on compressed trees, and proposes suitable generalizations of run-length BWT, high-order entropy, and string attractors to cardinal trees (tries).

Faster Queries on BWT-runs Compressed Indexes

A new compressed index on RLBWT is presented, which is called r-index-f, in which r- index is improved for faster locate queries and a novel backward search algorithm on the balanced BWT-sequences is presented.

Subpath Queries on Compressed Graphs: a Survey

This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.

Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.

Block Tree based Universal Self-Index for Repetitive Text Collections

Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others.

Towards a Definitive Compressibility Measure for Repetitive Sequences

This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space.

Optimal-Time Queries on BWT-Runs Compressed Indexes

The first compressed index on RLBWT is presented, which is called R-index-f, that supports various queries including locate, count, extract queries, decompression and prefix search in the optimal time with smaller working space of $O(r)$ words for small alphabets in this paper.

Indexing Highly Repetitive String Collections, Part I

This survey describes the distinct compression paradigms that have been used to exploit repetitiveness, and the algorithmic techniques that provide direct access to the compressed strings.



Universal indexes for highly repetitive document collections

On compressing and indexing repetitive sequences

At the roots of dictionary compression: string attractors

This paper provides matching lower and upper bounds for the random access problem on string attractors, and shows that the k-attractor problem — deciding whether a text has a size-t set of positions capturing all substrings of length at most k — is NP-complete for k≥ 3, including the full string attractor problem.

Self-Indexed Grammar-Based Compression

The first grammar-based self-index is introduced, a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules and a representation for binary relations with labels supporting various extended queries.

Optimal-Time Text Indexing in BWT-runs Bounded Space

This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$ space, on a RAM machine of $w=\Omega(\log n)$ bits.

Composite Repetition-Aware Data Structures

Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

Collage system: a unifying framework for compressed pattern matching

Sparse Suffix Tree Construction in Optimal Time and Space

A linear-time Monte Carlo algorithm is designed for sparse suffix tree construction, and this algorithm is complemented with a deterministic verification procedure that improves upon the bound of O(n log b) obtained by I et al.

Data compression via textual substitution

A general model for data compression which includes most data compression systems in the fiterature as special cases is presented and trade-offs between different varieties of macro schemes, exact lower bounds on the amount of compression obtainable, and the complexity of encoding and decoding are discussed.

Time-space trade-offs for Lempel-Ziv compressed indexing