Süleyman Cenk Sahinalp

Learn More
Recent studies demonstrating the existence of special noncoding "antisense" RNAs used in post transcriptional gene regulation have received considerable attention. These RNAs are synthesized naturally to control gene expression in C. elegans, Drosophila, and other organisms; they are known to regulate plasmid copy numbers in E. coli as well. Small RNAs have(More)
Recent studies show that along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. The Human Genome Structural Variation Project aims to identify and classify deletions, insertions, and inversions (>5 Kbp) in a small number of normal individuals with a fosmid-based paired-end sequencing(More)
MOTIVATION The high throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for the computational infrastructure. Data management, storage and analysis have become major logistical obstacles for those adopting the new platforms. The requirement for large investment for this purpose almost signalled the end of(More)
If the genetic maps of two species are modelled as permutations of (homologous) genes, the number of chromosomal rearrangements in the form of deletions, block moves, inversions etc. to transform one such permutation to another can be used as a measure of their evolutionary distance. Motivated by such scenarios, we study problems of computing distances(More)
We address the problem of minimizing the communication involved in the exchange of similar documents. We consider two users, A and B, who hold documents z and y respectively. Neither of the users has any information about the other's document. They exchange messages so that B computes x; it may be required that A compute y as well. Our goal is to design(More)
We study sequence nearest neighbors (SNN). Let D be a database of n sequences; we would like to preprocess D so that given any on-line query sequence Q we can quickly find a sequence S in D for which d(S; Q) d(S; T) for any other sequence T in D. Here d(S; Q) denotes the distance between sequences S and Q, defined to be the minimum number of edit operations(More)
In many database applications involving string data, it is common to have near neighbor queries (asking for strings that are similar to a query string) or nearest neighbor queries (asking for strings that are most similar to a query string). The similarity between strings is defined in terms of a distance function determined by the application domain. The(More)
—The problem of fast address lookup is crucial to routing and thus has received considerable attention. Most of the work in this field has focused on improving the speed of individual accesses – independent from the underlying access pattern. Recently, Gupta et al. [7] proposed an efficient data structure to exploit the bias in access pattern. This(More)
Computational genomics involves comparing sequences based on " similarity " for detecting evolutionary and functional relationships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Most sequencing effort was focused on genes and other short units; similarity between such(More)
We consider the parsing method to be used in dynamic dictionary based data compression. We show that (1) the commonly used greedy parsing may result in far from optimal compression with respect to the dictionary in use; (2) a one-lookahead greedy parsing scheme obtains optimality with respect to any dictionary construction schemes that satisfy the prefix(More)