Roger A. Sayle

Learn More
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of(More)
Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological(More)
We present a system employing large grammars and dictionaries to recognize a broad range of chemical entities. The system utilizes these resources to identify chemical entities without an explicit tokenization step. To allow recognition of terms slightly outside the coverage of these resources we employ spelling correction, entity extension, and merging of(More)
An analysis was performed on 335 pairs of structurally aligned proteins derived from the structural classification of proteins (SCOP http://scop.mrc-lmb.cam.ac.uk/scop/) database. These similarities were divided into analogues, defined as proteins with similar three-dimensional structures (same SCOP fold classification) but generally with different(More)
Awareness of the adverse effects of chemicals is important in biomedical research and healthcare. Text mining can allow timely and low-cost extraction of this knowledge from the biomedical literature. We extended our text mining solution, LeadMine, to identify diseases and chemical-induced disease relationships (CIDs). LeadMine is a dictionary/grammar-based(More)
Many sequence analysis problems involve consideration of a multiple sequence alignment where the 3-dimensional structure of one (or more) of the aligned sequences is known. In such cases, it is useful to map the sequence variability onto the atomic co-ordinates of known structure. If the structure also includes a bound ligand (or the location of the active(More)
Fold recognition methods aim to use the information in the known protein structures (the targets) to identify that the sequence of a protein of unknown structure (the probe) will adopt a known fold. This paper highlights that the structural similarities sought by these methods can be divided into two types: remote homologues and analogues. Homologues are(More)
When crystallization screening is conducted many outcomes are observed but typically the only trial recorded in the literature is the condition that yielded the crystal(s) used for subsequent diffraction studies. The initial hit that was optimized and the results of all the other trials are lost. These missing results contain information that would be(More)
We apply a recently published method of text-based molecular similarity searching (LINGO) to standard data sets for the purpose of quantifying the accuracy of the approach. Our implementation is based on a pattern-matching finite state machine (FSM) which results in fast search times. The accuracy of LINGO is demonstrated to be comparable to that of a(More)