Succinct Suffix Arrays based on Run-Length Encoding

  title={Succinct Suffix Arrays based on Run-Length Encoding},
  author={Veli M{\"a}kinen and Gonzalo Navarro},
  journal={Nord. J. Comput.},
A succinet full-text self-index is a data structure built on a text T =, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = in T, and is able to reproduce any text substring, so the self-index replaces the text.Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The… 

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.

Stronger Lempel-Ziv Based Compressed Text Indexing

Stronger Lempel-Ziv based indices (LZ-indices) are presented, improving the overall performance of the original LZ-index and achieving indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ -indices.

Reducing the Space Requirement of LZ-Index

Two different approaches to reduce the space requirement of LZ-index are presented and it is shown how the space can be squeezed to (1 + e)uHk(T) + o(ulogσ) to obtain a structure with O(m2) average search time for $m \geqslant 2\log_\sigma{u}$.

Space-efficient construction of Lempel-Ziv compressed text indexes

Space-Efficient Construction of LZ-Index

This paper presents a practical space-efficient algorithm to construct LZ-index, requiring (4+e)uHk+o(u) bits of space, for any constant 0<e<1, and O(σu) time, being σ the alphabet size.

Minimal Absent Words on Run-Length Encoded Strings

This paper focuses on the most basic compressed representation of a string, run-length encoding ( RLE), which represents each maximal run of the same characters a by a p where p is the length of the run.

Ziv-Lempel Compressed Full-Text Self-Indexes

This thesis proposes a deep study of compressed full-text self-indexes based on the Ziv-Lempel compression algorithm, focusing on the Navarro’s LZ-index, which has many interesting properties: fast full- text searching and text recovery; using little space for construction and operation; allowing insertion and deletion of text; providing a range of space/time trade-offs; and efficient construction and search in secondary memory.

Optimal-Time Text Indexing in BWT-runs Bounded Space

This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$ space, on a RAM machine of $w=\Omega(\log n)$ bits.

Engineering Fully-Compressed Suffix Trees

This work proposes a variant of the FCST that improves pattern matching both in theory and in practice using a blind search approach and shows that the implementation outperforms the previous prototype in both space consumption and query/construction time.

Run-Length Compressed Indexes for Repetitive Sequence Collections

New static/dynamic full-text self-indexes based on the run-length encoding whose space-requirements are much less dependent on N are developed, and can be plugged into a recent dynamic fully-compressed suffix tree using an additionalO((N/δ)log N) bits of space for any δ = polylog(N), and retaining the poly log(N) time slowdown on operations.



First Huffman, Then Burrows-Wheeler: A Simple Alphabet-Independent FM-Index

The main problem of the FM-index is that its space usage depends exponentially on σ, that is, 5H k n + σ σ o(n) for any k, H k being the k-th order entropy of T.

Time-space trade-offs for compressed suffix arrays

  • S. S. Rao
  • Computer Science
    Inf. Process. Lett.
  • 2002

Run-Length FM-index

The FM-index is shown how the same ideas can be used to obtain an index needing O(Hkn) bits of space, with the constant factor depending only logarithmically on σ.

Compressed Compact Suffix Arrays

It is shown that the occ occurrence positions of a pattern of length m in a text of length n can be reported in O((m+occ)log n) time using the CCSA, whose representation needs O(n(1+H k log n) bits for any k, H k being the k-th order empirical entropy of the text.

Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

An index structure is constructed that occupies only O(n) bits and compares favorably with inverted lists in space and achieves optimal O(m/log n) search time for sufficiently large m = ~(log a+~ n).

Advantages of Backward Searching - Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays

The most remarkable one is that the CSA does not need any complicated sub-linear structures based on the four-Russians technique, and it is shown that sampling and compression are enough to achieve O(mlog n) query time using less space than the original structure.

Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

A compressed text database based on the compressed suffix array is proposed, and the relationship with the opportunistic data structure of Ferragina and Manzini is shown.

Space Efficient Suffix Trees

This work gives a representation of a suffix tree that uses \(n \lg n + O(n)\) bits of space and supports searching for a pattern in the given text in O(m) time and develops a structure that uses a suffix array and an additional o(n) bits.

High-order entropy-compressed text indexes

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of <i>n</i> symbols over an alphabet σ,

Succinct representations of lcp information and improvements in the compressed suffix arrays

Two succinct data structures are introduced for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and an improvement in the compressed suffix array which supports linear time counting queries for any pattern.