Efficiently Mining Sequential Generator Patterns Using Prefix Trees
- Thi-Thiet Pham
- Fundam. Inform.
Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Sequence motifs in proteins are some subsequences which appear frequently. Motifs often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. The SP-index algorithm was proposed to find sequence motifs containing gaps of arbitrary size. To find motifs, it constructs B-trees for indexing the occurring positions of short segments. Then, to check whether a long pattern composed of short segments appears frequently, the SP-index algorithm needs to test a large number of nodes of those B-trees, which may not be efficient. Therefore, in this paper, we propose the BitPattern-based (BP) algorithm to improve the efficiency of the SP-index algorithm. First, the BP algorithm transforms the protein sequences into bit patterns. Then, instead of testing a large number of nodes in the SP-index algorithm, the BP algorithm utilizes bit operations, i.e., AND, OR, shifting and masking, to efficiently find sequence motifs. The BP algorithm also performs a pruning step to reduce the processing time. From the experimental results on biological and synthetic data sets, we show that the BP algorithm needs shorter processing time than the SP-index algorithm.