Fast Label Extraction in the CDAWG
@article{Belazzougui2017FastLE, title={Fast Label Extraction in the CDAWG}, author={Djamal Belazzougui and Fabio Cunial}, journal={ArXiv}, year={2017}, volume={abs/1707.08197} }
The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(m\log{\log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data…
21 Citations
Online Algorithms for Constructing Linear-size Suffix Trie
- Computer ScienceCPM
- 2019
Two types of online algorithms which `directly' construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure are presented.
Fully-functional bidirectional Burrows-Wheeler indexes
- Computer ScienceArXiv
- 2019
An index that supports bidirectional addition and removal in $O(\log{\log{|T|}})$ time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of $T$.
Optimal-Time Text Indexing in BWT-runs Bounded Space
- Computer ScienceSODA
- 2018
This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$ space, on a RAM machine of $w=\Omega(\log n)$ bits.
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
- Computer ScienceJ. ACM
- 2020
This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.
Towards a Definitive Measure of Repetitiveness
- Computer Science, MathematicsLATIN
- 2020
A smaller measure, $\delta$, is studied, which can be computed in linear time and captures better the concept of compressibility in repetitive strings, and it is proved that, for some string families, it holds $\gamma = \Omega(\delta \log n)$.
Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries
- Computer ScienceLATIN
- 2020
A full perspective on the sizes of indexing structures such as suffix trees, DAWGs, and CDAWGs for forward and backward tries is shown.
Towards a Definitive Compressibility Measure for Repetitive Sequences
- Computer ScienceIEEE Transactions on Information Theory
- 2022
This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space.
The colored longest common prefix array computed via sequential scans
- Computer ScienceSPIRE
- 2018
This paper proposes an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances.
Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs
- Computer ScienceCPM
- 2019
An index that supports bidirectional addition and removal in O(log log |T |) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T .
Block Tree based Universal Self-Index for Repetitive Text Collections
- Computer Science
- 2020
Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others.
References
SHOWING 1-10 OF 21 REFERENCES
Representing the suffix tree with the CDAWG
- Computer ScienceCPM
- 2017
This technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the prefix array, of the inverse suffix array, and of $T$ itself, that takes O(e_T) words of space, and that supports random access in $O(\log{n})$ time.
Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression
- Computer ScienceSPIRE
- 2017
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs…
Composite Repetition-Aware Data Structures
- Computer ScienceCPM
- 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
Succinct Suffix Arrays based on Run-Length Encoding
- Computer ScienceNord. J. Comput.
- 2005
A new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n), and it is shown that the RL FM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.
Fully compressed suffix trees
- Computer ScienceTALG
- 2011
This article introduces the first compressed suffix tree representation that requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time.
Fast Fully-Compressed Suffix Trees
- Computer Science2014 Data Compression Conference
- 2014
This work significantly accelerates the fully-compressed suffix tree representation (FCST), and the resulting FCST variant becomes very attractive in terms of space and time, and a promising alternative in practice.
Storage and Retrieval of Highly Repetitive Sequence Collections
- BiologyJ. Comput. Biol.
- 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections
- Computer ScienceSPIRE
- 2008
It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.