Image-based automated chemical database annotation with ensemble of machine-vision classifiers
We present an application of a novel methodology called Text Influenced Molecular Indexing (TIMI) to mine the information in the scientific literature. TIMI is an extension of two existing methodologies: (1) Latent Semantic Structure Indexing (LaSSI), a method for calculating chemical similarity using two-dimensional topological descriptors, and (2) Latent Semantic Indexing (LSI), a method for generating correlations between textual terms. The singular value decomposition (SVD) of a feature/object matrix is the fundamental mathematical operation underlying LSI, LaSSI, and TIMI and is used in the identification of associations between textual and chemical descriptors. We present the results of our studies with a database containing 11,571 PubMed/MEDLINE abstracts which show the advantages of merging textual and chemical descriptors over using either text or chemistry alone. Our work demonstrates that searching text-only databases limits retrieved documents to those that explicitly mention compounds by name in the text. Similarly, searching chemistry-only databases can only retrieve those documents that have chemical structures in them. TIMI, however, enables search and retrieval of documents with textual, chemical, and/or text- and chemistry-based queries. Thus, the TIMI system offers a powerful new approach to uncovering the contextual scientific knowledge sought by the medical research community.