Learn More
Cytosine methylation is a DNA modification that has great impact on the regulation of gene expression and important implications for the biology and health of several living beings, including humans. Bisulfite conversion followed by next generation sequencing (BS-seq) of DNA is the gold standard technique used to detect DNA methylation at single-base(More)
Longest Common Extension (LCE) queries are a fundamental sub-routine in many string-processing algorithms, including (but not limited to) suffix-sorting, string matching, and identification of palindrome factors and repeats. A LCE query takes as input two positions i, j in a text T ∈ Σ n and returns the length ℓ of the longest common prefix between T 's(More)
In this paper, we show that the LZ77 factorization of a text T &#x03B5; &#x03A3;<sup>n</sup> can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T (reversed). For (extremely) repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the(More)
In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide(More)
In this paper we address the longest common extension (LCE) problem: to compute the length ℓ of the longest common prefix between any two suffixes of T ∈ Σ n with Σ = {0,. .. σ − 1}. We present two fast and space-efficient solutions based on (Karp-Rabin) fingerprinting and sampling. Our first data structure exploits properties of Mersenne prime numbers when(More)
The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a(More)
We consider the problem of indexing a text T (of length n) with a light data structure that supports efficient search of patterns P (of length m) allowing errors under the Hamming distance. We propose a hash-based strategy that employs two classes of hash functions—dubbed Hamming-aware and de Bruijn—to drastically reduce search space and memory footprint of(More)
Re-Pair [5] is an effective grammar-based compression scheme achieving strong compression rates in practice. Let n, &#x03C3;, and d be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and 5n + 4&#x03C3;2 + 4d +(More)