Optimal-Time Text Indexing in BWT-runs Bounded Space

@inproceedings{Gagie2017OptimalTimeTI,
  title={Optimal-Time Text Indexing in BWT-runs Bounded Space},
  author={Travis Gagie and Gonzalo Navarro and Nicola Prezza},
  booktitle={ACM-SIAM Symposium on Discrete Algorithms},
  year={2017}
}
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m… 

Figures and Tables from this paper

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.

Bi-Directional r-Indexes

The br-index is proposed, which supports extending the matched pattern both in forward and backward directions, and locating the occurrences of the pattern at any step of the search, within O ( r + r R ) words of space, where r R is the number of equal-letter runs in the BWT of the reversed text.

Practical Indexing of Repetitive Collections Using Relative Lempel-Ziv

This work introduces a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ), which achieves the least space among competing structures while outperforming or matching them in time.

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

This paper combines the virtues of a grammar with the R LBWT by building the RLBWT on top of a special grammar based on induced suffix sorting, and reveals that the hybrid approach outperforms the classic RLB WT with respect to the index sizes and query times on biological data sets for sufficiently long patterns, which could be interesting for alignment long reads in bioinformatics.

Optimal Construction of Compressed Indexes for Highly Repetitive Texts

Algorithms that construct the Burrows-Wheeler transform, the permuted longest-common-prefix array, and the LZ77 parsing in O(n/ logσ n + r polylog n) time and working space are proposed, where r is the number of runs in the BWT of the input.

Computing the optimal BWT of very large string collections

This paper presents the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), and presents results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making the tool competitive with other tools for BWT-computation in Terms of running time and space usage.

Resolution of the Burrows-Wheeler Transform Conjecture

This paper shows that r=\mathcal{O}(z\log^{2}n)$ holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.

Fast, Small, and Simple Document Listing on Repetitive Text Collections

A simple document listing index for repetitive string collections of total length n that lists the distinct documents where a pattern of length m appears in time that sharply outperforms existing alternatives in the space/time tradeoff map.

A theoretical and experimental analysis of BWT variants for string collections

It is found that the differences between these BWT variants can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences.

Universal Compressed Text Indexing 1

This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.
...

References

SHOWING 1-10 OF 108 REFERENCES

Composite Repetition-Aware Data Structures

Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

Compressed Computation for Text Indexing

This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).

Self-Indexed Grammar-Based Compression

The first grammar-based self-index is introduced, a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules and a representation for binary relations with labels supporting various extended queries.

Fully Dynamic Data Structure for LCE Queries in Compressed Space

The signature encoding of $\mathcal{G}$ of size of $T$ has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, and it is shown that this is the first fully dynamic LCE data structure.

Improved Grammar-Based Compressed Indexes

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is

Compressed Text Indexes with Fast Locate

This paper introduces a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes, and permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.

Storage and Retrieval of Individual Genomes

The structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

Universal indexes for highly repetitive document collections

Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

Time-space trade-offs for Lempel-Ziv compressed indexing

...