Optimal-Time Text Indexing in BWT-runs Bounded Space
@inproceedings{Gagie2017OptimalTimeTI, title={Optimal-Time Text Indexing in BWT-runs Bounded Space}, author={Travis Gagie and Gonzalo Navarro and Nicola Prezza}, booktitle={ACM-SIAM Symposium on Discrete Algorithms}, year={2017} }
Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is $r$, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used $O(r)$ space and was able to efficiently count the number of occurrences of a pattern of length $m…
89 Citations
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
- Computer ScienceJ. ACM
- 2020
This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.
Bi-Directional r-Indexes
- Computer ScienceCPM
- 2022
The br-index is proposed, which supports extending the matched pattern both in forward and backward directions, and locating the occurrences of the pattern at any step of the search, within O ( r + r R ) words of space, where r R is the number of equal-letter runs in the BWT of the reversed text.
Practical Indexing of Repetitive Collections Using Relative Lempel-Ziv
- Computer Science2019 Data Compression Conference (DCC)
- 2019
This work introduces a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ), which achieves the least space among competing structures while outperforming or matching them in time.
FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns
- Computer Science2022 Data Compression Conference (DCC)
- 2022
This paper combines the virtues of a grammar with the R LBWT by building the RLBWT on top of a special grammar based on induced suffix sorting, and reveals that the hybrid approach outperforms the classic RLB WT with respect to the index sizes and query times on biological data sets for sufficiently long patterns, which could be interesting for alignment long reads in bioinformatics.
Optimal Construction of Compressed Indexes for Highly Repetitive Texts
- Computer ScienceSODA
- 2019
Algorithms that construct the Burrows-Wheeler transform, the permuted longest-common-prefix array, and the LZ77 parsing in O(n/ logσ n + r polylog n) time and working space are proposed, where r is the number of runs in the BWT of the input.
Computing the optimal BWT of very large string collections
- Computer ScienceArXiv
- 2022
This paper presents the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), and presents results both on real-life and simulated data, showing that the improvement achieved in terms of r with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making the tool competitive with other tools for BWT-computation in Terms of running time and space usage.
Resolution of the Burrows-Wheeler Transform Conjecture
- Computer Science2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)
- 2020
This paper shows that r=\mathcal{O}(z\log^{2}n)$ holds for every text, and proves that many results related to BWT automatically apply to methods based on LZ77, and implies the first non-trivial relation between the number of runs in the BWT of the text and its reverse.
Fast, Small, and Simple Document Listing on Repetitive Text Collections
- Computer ScienceSPIRE
- 2019
A simple document listing index for repetitive string collections of total length n that lists the distinct documents where a pattern of length m appears in time that sharply outperforms existing alternatives in the space/time tradeoff map.
A theoretical and experimental analysis of BWT variants for string collections
- Computer ScienceCPM
- 2022
It is found that the differences between these BWT variants can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences.
Universal Compressed Text Indexing 1
- Computer Science
- 2018
This paper develops the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can be built on top of any dictionary-compressed text representation, and shows that the relation between indexing and compression is much deeper than what was previously thought.
References
SHOWING 1-10 OF 108 REFERENCES
Composite Repetition-Aware Data Structures
- Computer ScienceCPM
- 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
Compressed Computation for Text Indexing
- Computer Science
- 2017
This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).
Self-Indexed Grammar-Based Compression
- Computer ScienceFundam. Informaticae
- 2011
The first grammar-based self-index is introduced, a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules and a representation for binary relations with labels supporting various extended queries.
Fully Dynamic Data Structure for LCE Queries in Compressed Space
- Computer ScienceMFCS
- 2016
The signature encoding of $\mathcal{G}$ of size of $T$ has a capability to support LCE queries in $O(\log N + \log \ell \log^* M)$ time, and it is shown that this is the first fully dynamic LCE data structure.
Improved Grammar-Based Compressed Indexes
- Computer ScienceSPIRE
- 2012
We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is…
Compressed Text Indexes with Fast Locate
- Computer ScienceCPM
- 2007
This paper introduces a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes, and permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.
Storage and Retrieval of Individual Genomes
- Computer ScienceRECOMB
- 2009
The structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.
Storage and Retrieval of Highly Repetitive Sequence Collections
- BiologyJ. Comput. Biol.
- 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
Time-space trade-offs for Lempel-Ziv compressed indexing
- Computer ScienceTheor. Comput. Sci.
- 2017