Learn More
In this paper we address the problem of identifying differences between populations of trees. Besides the theoretical relevance of this problem, we are interested in testing if trees characterizing protein sequences from different families constitute samples of significantly different distributions. In this context, trees are obtained by modelling protein(More)
The completion of the genome sequence of Plasmodium falciparum revealed that close to 60% of the annotated genome corresponds to hypothetical proteins and that many genes, whose metabolic pathways or biological products are known, have not been predicted from sequence similarity searches. Recently, using global gene expression of the asexual blood stages of(More)
MOTIVATION A central problem in genomics is to determine the function of a protein using the information contained in its amino acid sequence. Variable length Markov chains (VLMC) are a promising class of models that can effectively classify proteins into families and they can be estimated in linear time and space. RESULTS We introduce a new algorithm,(More)
We introduce a new criterion to select in a consistent way the probabilistic context tree generating a sample. The basic idea is to construct a totally ordered set of candidate trees. This set is composed by the " champion trees " , the ones that maximize the likelihood of the sample for each number of degrees of freedom. The smallest maximizer criterion(More)
The completion of the genome sequence of Plasmodium falciparum revealed that close to 60% of the annotated genome corresponds to hypothetical proteins and that many genes, whose metabolic pathways or biological products are known biochemically, had not been predicted. Recently, using global gene expression of the asexual blood stages of P. falciparum at 1h(More)
Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as(More)
We find upper bounds for the probability of underestimation and overestimation errors in penalized likelihood context tree estimation. The bounds are explicit and applies to processes of not necessarily finite memory. We allow for general penalizing terms and we give conditions over the maximal depth of the estimated trees in order to get strongly(More)
We consider binary infinite order stochastic chains perturbed by a random noise. This means that at each time step, the value assumed by the chain can be randomly and independently flipped with a small fixed probability. We show that the transition probabilities of the perturbed chain are uniformly close to the corresponding transition probabilities of the(More)
The goal of this paper is to study the similarity between sequences using a distance between the context trees associated to the sequences. These trees are defined in the framework of Sparse Probabilistic Suffix Trees (SPST), and can be estimated using the SPST algorithm. We implement the Phyl-SPST package to compute the distance between the sparse context(More)