Cytosine methylation is a DNA modification that has great impact on the regulation of gene expression and important implications for the biology and health of several living beings, including humans. Bisulfite conversion followed by next generation sequencing (BS-seq) of DNA is the gold standard technique used to detect DNA methylation at single-base… (More)

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide… (More)

- Alberto Policriti, Nicola Prezza
- 2016 Data Compression Conference (DCC)
- 2016

In this paper, we show that the LZ77 factorization of a text T ε Σ<sup>n</sup> can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T (reversed). For (extremely) repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the… (More)

- Alberto Policriti, Nicola Gigante, Nicola Prezza
- LATA
- 2015

Indexing highly repetitive texts — such as genomic databases, software repositories and versioned text collections — has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive… (More)

- Alberto Policriti, Nicola Prezza
- BMC Bioinformatics
- 2015

The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a… (More)

- Alberto Policriti, Nicola Prezza
- SPIRE
- 2015

- Alberto Policriti, Nicola Prezza
- ISAAC
- 2014

We consider the problem of indexing a text T (of length n) with a light data structure that supports efficient search of patterns P (of length m) allowing errors under the Hamming distance. We propose a hash-based strategy that employs two classes of hash functions—dubbed Hamming-aware and de Bruijn—to drastically reduce search space and memory footprint of… (More)

- Nicola Prezza, Francesco Vezzi, Max Käller, Alberto Policriti
- BMC Bioinformatics
- 2016

Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of… (More)

- Nicola Prezza
- ArXiv
- 2016

Longest Common Extension (LCE) queries are a fundamental sub-routine in many stringprocessing algorithms, including (but not limited to) suffix-sorting, string matching, and identification of palindrome factors and repeats. A LCE query takes as input two positions i, j in a text T ∈ Σ and returns the length l of the longest common prefix between T ’s i-th… (More)