Finding the region of pseudo-periodic tandem repeats in biological sequences
Algorithm development for finding quasiperiodic regions in sequences is at the core of many problems arising in biological sequence analysis. We solve an important problem in this area. Let A be an alphabet of size n and A’ denote the set of sequences of length 1 over A. Given a sequence S = ~1.52 . . .sl E A’, a positive integer p is called a period of S if s; = s;+~ for 1 5 i 5 1 p. S is called p-periodic if it has a minimum period p. Let n,(p) denote the set of p-periodic sequences in A I. A natural measure of “nearness to p-periodicity” for S is the average Hamming distance to the nearest p-periodic sequence: D(S) = minTEal(plD(S,T). If T is a sequence E n,(p) such that D(S,T) = D(S), then T is called a nearest p-periodic sequence of S and S is called pquasiperiodic associated with the score D(S). This paper develops an efficient algorithm for finding a nearest p-periodic sequence of S by means of its modulop incidence matrix. Let c\/ = (crr;..,c\/,) and /? = (q+ l;..,q+l 4 , ” ,>,>.$ where 1 = CV~ + CV~ + . . . + CV, is a partition of 1 and 4 is the quotientPaLd r is the remainder when 1 is divided by p. This paper shows that there exists a sequence in A’ whose modulo-p incidence matrix has row sum vector c\/ and column sum vector 0.