# Optimal-Time Text Indexing in BWT-runs Bounded Space

@inproceedings{Gagie2017OptimalTimeTI,
title={Optimal-Time Text Indexing in BWT-runs Bounded Space},
author={Travis Gagie and Gonzalo Navarro and Nicola Prezza},
booktitle={ACM-SIAM Symposium on Discrete Algorithms},
year={2017}
}
• Published in
ACM-SIAM Symposium on…
29 May 2017
• Computer Science
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length m… 89 Citations ## Figures and Tables from this paper • Computer Science J. ACM • 2020 This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. • Computer Science CPM • 2022 The br-index is proposed, which supports extending the matched pattern both in forward and backward directions, and locating the occurrences of the pattern at any step of the search, within O ( r + r R ) words of space, where r R is the number of equal-letter runs in the BWT of the reversed text. • Computer Science 2019 Data Compression Conference (DCC) • 2019 This work introduces a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ), which achieves the least space among competing structures while outperforming or matching them in time. • Computer Science 2022 Data Compression Conference (DCC) • 2022 This paper combines the virtues of a grammar with the R LBWT by building the RLBWT on top of a special grammar based on induced suffix sorting, and reveals that the hybrid approach outperforms the classic RLB WT with respect to the index sizes and query times on biological data sets for sufficiently long patterns, which could be interesting for alignment long reads in bioinformatics. Algorithms that construct the Burrows-Wheeler transform, the permuted longest-common-prefix array, and the LZ77 parsing in O(n/ logσ n + r polylog n) time and working space are proposed, where r is the number of runs in the BWT of the input. • Computer Science ArXiv • 2022 This paper presents the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), and presents results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is signiﬁcant and the overhead created by the computation of the optimal BWT negligible, making the tool competitive with other tools for BWT-computation in Terms of running time and space usage. • Computer Science 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) • 2020 This paper shows that r=\mathcal{O}(z\log^{2}n) holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.
• Computer Science
SPIRE
• 2019
A simple document listing index for repetitive string collections of total length n that lists the distinct documents where a pattern of length m appears in time that sharply outperforms existing alternatives in the space/time tradeoff map.
• Computer Science
CPM
• 2022
It is found that the differences between these BWT variants can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences.
• Computer Science
• 2018
This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.

## References

SHOWING 1-10 OF 108 REFERENCES

• Computer Science
CPM
• 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).
• Computer Science
Fundam. Informaticae
• 2011
The first grammar-based self-index is introduced, a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules and a representation for binary relations with labels supporting various extended queries.
• Computer Science
MFCS
• 2016
The signature encoding of $\mathcal{G}$ of size of $T$ has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, and it is shown that this is the first fully dynamic LCE data structure.
• Computer Science
SPIRE
• 2012
We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is
• Computer Science
CPM
• 2007
This paper introduces a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes, and permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.
• Computer Science
RECOMB
• 2009
The structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
• Biology
J. Comput. Biol.
• 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.