Fast Label Extraction in the CDAWG

  title={Fast Label Extraction in the CDAWG},
  author={Djamal Belazzougui and Fabio Cunial},
The compact directed acyclic word graph (CDAWG) of a string $T$ of length $n$ takes space proportional just to the number $e$ of right extensions of the maximal repeats of $T$, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which $e$ grows significantly more slowly than $n$. We reduce from $O(m\log{\log{n}})$ to $O(m)$ the time needed to count the number of occurrences of a pattern of length $m$, using an existing data… 

Online Algorithms for Constructing Linear-size Suffix Trie

Two types of online algorithms which `directly' construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure are presented.

Fully-functional bidirectional Burrows-Wheeler indexes

An index that supports bidirectional addition and removal in $O(\log{\log{|T|}})$ time, and that occupies a number of words proportional to the number of left and right extensions of the maximal repeats of $T$.

Optimal-Time Text Indexing in BWT-runs Bounded Space

This paper shows how to extend the Run-Length FM-index so that it can locate the occurrences of a pattern efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time $O(m+occ)$ within £O(r\log(n/r)$ space, on a RAM machine of $w=\Omega(\log n)$ bits.

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

This article shows how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space, and outperforms the space-competitive alternatives by 1--2 orders of magnitude in time.

Towards a Definitive Measure of Repetitiveness

A smaller measure, $\delta$, is studied, which can be computed in linear time and captures better the concept of compressibility in repetitive strings, and it is proved that, for some string families, it holds $\gamma = \Omega(\delta \log n)$.

Suffix Trees, DAWGs and CDAWGs for Forward and Backward Tries

A full perspective on the sizes of indexing structures such as suffix trees, DAWGs, and CDAWGs for forward and backward tries is shown.

Towards a Definitive Compressibility Measure for Repetitive Sequences

This paper argues that δ better captures the compressibility of repetitive strings, and studies an even smaller measure, δ ≤ γ, which can be computed in linear time, is monotone, and allows encoding every string in O ( δ log nδ ) space.

The colored longest common prefix array computed via sequential scans

This paper proposes an efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances.

Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs

An index that supports bidirectional addition and removal in O(log log |T |) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T .

Block Tree based Universal Self-Index for Repetitive Text Collections

Being able to manipulate the text within compressed space, with a compression related to its repetitiveness has a critical importance in many areas of study such as Bioinformatics, Information Retrieval, Data Mining, among others.



Representing the suffix tree with the CDAWG

This technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the prefix array, of the inverse suffix array, and of $T$ itself, that takes O(e_T) words of space, and that supports random access in $O(\log{n})$ time.

Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression

In this paper, we propose a novel approach to combine compact directed acyclic word graphs (CDAWGs) and grammar-based compression. This leads us to an efficient self-index, called Linear-size CDAWGs

Composite Repetition-Aware Data Structures

Two data structures are described whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure.

Succinct Suffix Arrays based on Run-Length Encoding

A new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n), and it is shown that the RL FM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.

Fully compressed suffix trees

This article introduces the first compressed suffix tree representation that requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time.

Linear-size suffix tries

Fast Fully-Compressed Suffix Trees

This work significantly accelerates the fully-compressed suffix tree representation (FCST), and the resulting FCST variant becomes very attractive in terms of space and time, and a promising alternative in practice.

Finding Level-Ancestors in Trees

Storage and Retrieval of Highly Repetitive Sequence Collections

New static and dynamic full-text indexes are developed that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations.

Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

It is shown that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task, and some new structures that use run-length encoding are engineer and empirical evidence that these structures are superior to the current structures are given.