Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures
@inproceedings{Navarro2020IndexingHR, title={Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures}, author={Gonzalo Navarro}, year={2020} }
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability to handle them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly…
Figures and Tables from this paper
34 Citations
Indexing Highly Repetitive String Collections, Part II
- Computer ScienceACM Comput. Surv.
- 2022
This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.
Subpath Queries on Compressed Graphs: a Survey
- Computer ScienceAlgorithms
- 2021
This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today’s compressed indexes for labeled graphs and regular languages.
Document Retrieval Hacks
- Computer ScienceSEA
- 2021
Simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured are described.
A New Class of String Transformations for Compressed Text Indexing
- Computer ScienceArXiv
- 2022
This paper introduces a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT, and shows that this new family is a special case of a much larger class of transformations, based on context adaptive alphabet orderings, that includes BWT and ABWT.
Accelerating computation on compressed data via Context-Free Grammars
- Computer Science
- 2021
This research will study an efficient method of grammar compression constructed through locally consistent parsing, and research a grammar-based compression method to improve algorithmic efficiency on abstract data types.
Towards a Definitive Compressibility Measure for Repetitive Sequences
- Computer ScienceIEEE Transactions on Information Theory
- 2022
This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space.
Efficient Construction of the BWT for Repetitive Text Using String Compression
- Computer ScienceCPM
- 2022
A new semi-external algorithm that builds the Burrows–Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time using compression techniques to reduce the computational costs when the input is massive and repetitive.
Breaking the O(n)-Barrier in the Construction of Compressed Suffix Arrays
- Computer ScienceArXiv
- 2021
A long-standing barrier is broken with a new data structure that takes O(n log σ) bits, answers suffix array queries in O(log n) time, and can be constructed in O-log σ/ √ log n time using O( n log ρ) bits of space.
A theoretical and experimental analysis of BWT variants for string collections
- Computer ScienceCPM
- 2022
It is found that the differences between these BWT variants can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences.
New approaches to compressibility and repetitiveness
- Computer Science
- 2021
This thesis aims to improve the current understanding of state of the art measures like W, 1 — the smallest reachable measure to date, and X — considered a stable lower bound for repetitiveness — and introduce new measures of repetitiveness, achieving better rates of compression than the currently studied.
References
SHOWING 1-10 OF 106 REFERENCES
Indexing Highly Repetitive String Collections, Part II: Compressed Indexes
- Computer Science
- 2020
This survey covers the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations.
Storage and Retrieval of Highly Repetitive Sequence Collections
- BiologyJ. Comput. Biol.
- 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
Compressed Computation for Text Indexing
- Computer Science
- 2017
This thesis deals with space-efficient algorithms to compress and index texts and shows that these two tools can be combined in a single index gathering the best features of the above-discussed indexes: fast queries, and strong compression rates (up to exponential compression can be achieved).
Composite Repetition-Aware Data Structures
- Computer ScienceCPM
- 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
Flexible Indexing of Repetitive Collections
- Computer ScienceCiE
- 2017
Practical data structures that support counting and locating all the exact occurrences of a pattern in a repetitive text are described, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors.
Compressed full-text indexes
- Computer ScienceCSUR
- 2007
The relationship between text entropy and regularities that show up in index structures and permit compressing them are explained and the most relevant self-indexes are covered, focusing on how they exploit text compressibility to achieve compact structures that can efficiently solve various search problems.
Random Access in Persistent Strings
- Computer ScienceISAAC
- 2020
This work shows how to represent the corresponding collection in O(n) space and optimal $O(\log n/ \log \log n)$ query time, which improves the previous time-space trade-offs for the problem.
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
- Computer ScienceJ. ACM
- 2020
This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.