Learn More
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark(More)
Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified,(More)
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by(More)
The problem of predicting gene locations in newly sequenced DNA is well known but still far from being successfully resolved. A novel approach to the problem based on the frame dependent (non-homogeneous) Markov chain models of protein-coding regions was previously suggested. This approach is, apparently, one of the most powerful "search by content"(More)
The woodland strawberry, Fragaria vesca (2n = 2x = 14), is a versatile experimental plant system. This diminutive herbaceous perennial has a small genome (240 Mb), is amenable to genetic transformation and shares substantial sequence identity with the cultivated strawberry (Fragaria × ananassa) and other economically important rosaceous plants. Here we(More)
cagA, a gene that codes for an immunodominant antigen, is present only in Helicobacter pylori strains that are associated with severe forms of gastroduodenal disease (type I strains). We found that the genetic locus that contains cagA (cag) is part of a 40-kb DNA insertion that likely was acquired horizontally and integrated into the chromosomal glutamate(More)
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective(More)
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. Here we propose a new, heuristic method producing fairly accurate inhomogeneous Markov models of protein coding regions. The new method(More)
Chlorella variabilis NC64A, a unicellular photosynthetic green alga (Trebouxiophyceae), is an intracellular photobiont of Paramecium bursaria and a model system for studying virus/algal interactions. We sequenced its 46-Mb nuclear genome, revealing an expansion of protein families that could have participated in adaptation to symbiosis. NC64A exhibits(More)
We describe a new ab initio algorithm, GeneMark-ES version 2, that identifies protein-coding genes in fungal genomes. The algorithm does not require a predetermined training set to estimate parameters of the underlying hidden Markov model (HMM). Instead, the anonymous genomic sequence in question is used as an input for iterative unsupervised training. The(More)