# Fast Label Extraction in the CDAWG

@article{Belazzougui2017FastLE,
title={Fast Label Extraction in the CDAWG},
author={Djamal Belazzougui and Fabio Cunial},
journal={ArXiv},
year={2017},
volume={abs/1707.08197}
}
• Published 25 July 2017
• Computer Science
• ArXiv
The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(m\log{\log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data…
• Computer Science
CPM
• 2019
Two types of online algorithms which `directly' construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure are presented.
• Computer Science
ArXiv
• 2019
An index that supports bidirectional addition and removal in $O(\log{\log{|T|}})$ time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of $T$.
• Computer Science
SODA
• 2018
This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$space, on a RAM machine of$w=\Omega(\log n)$bits. • Computer Science J. ACM • 2020 This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. • Computer Science, Mathematics LATIN • 2020 A smaller measure,$\delta$, is studied, which can be computed in linear time and captures better the concept of compressibility in repetitive strings, and it is proved that, for some string families, it holds$\gamma = \Omega(\delta \log n)$. A full perspective on the sizes of indexing structures such as suffix trees, DAWGs, and CDAWGs for forward and backward tries is shown. • Computer Science IEEE Transactions on Information Theory • 2022 This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space. • Computer Science SPIRE • 2018 This paper proposes an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances. • Computer Science CPM • 2019 An index that supports bidirectional addition and removal in O(log log |T |) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T . Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others. ## References SHOWING 1-10 OF 21 REFERENCES • Computer Science CPM • 2017 This technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the prefix array, of the inverse suffix array, and of$T$itself, that takes O(e_T) words of space, and that supports random access in$O(\log{n})\$ time.
• Computer Science
SPIRE
• 2017
In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs
• Computer Science
CPM
• 2015
Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.
• Computer Science
Nord. J. Comput.
• 2005
A new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n), and it is shown that the RL FM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.
• Computer Science
TALG
• 2011
This article introduces the first compressed suffix tree representation that requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time.
• Computer Science
Theor. Comput. Sci.
• 2016
• Computer Science
2014 Data Compression Conference
• 2014
This work significantly accelerates the fully-compressed suffix tree representation (FCST), and the resulting FCST variant becomes very attractive in terms of space and time, and a promising alternative in practice.
• Computer Science
J. Comput. Syst. Sci.
• 1994
• Biology
J. Comput. Biol.
• 2010
New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.
• Computer Science
SPIRE
• 2008
It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.