Hector Ferrada

Learn More
Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm(More)
We introduce a compression technique for suffix arrays. It is sensitive to the compressibility of the text and <i>local</i>, meaning that random portions of the suffix array can be decompressed by accessing mostly contiguous memory areas. This makes decompression very fast, especially when various contiguous cells must be accessed. Our main technical(More)
Fischer and Heun [SICOMP 2011] proposed the first Range Minimum Query (RMQ) data structure on an array A[1, n] that uses 2n + o(n) bits and answers queries in O(1) time without accessing A. Their scheme converts the Cartesian tree of A into a general tree, which is represented using DFUDS. We show that, by using instead the BP representation, the formula(More)
With current hardware and software, a standard computer can now hold in RAM an index for approximate pattern matching on about half a dozen human genomes. Sequencing technologies have improved so quickly, however, that scientists will soon demand indexes for thousands of genomes. Whereas most researchers who have addressed this problem have proposed(More)
Given a collection of strings (called documents), the top-k document retrieval problem is that of, given a string pattern p, finding the k documents where p appears most often. This is a basic task in most information retrieval scenarios. The best current implementations require 20–30 bits per character (bpc) and k to 4k microseconds per query, or 12–24 bpc(More)
  • 1