HighSim: Highly Effective Similarity Measurement in Large Heterogeneous Information Networks
BACKGROUND As the volume of biomedical text increases exponentially, automatic indexing becomes increasingly important. However, existing approaches do not distinguish central (or core) concepts from concepts that were mentioned in passing. We focus on the problem of indexing MEDLINE records, a process that is currently performed by highly trained humans at the National Library of Medicine (NLM). NLM indexers are assisted by a system called the Medical Text Indexer (MTI) that suggests candidate indexing terms. OBJECTIVE To improve the ability of MTI to select the core terms in MEDLINE abstracts. These core concepts are deemed to be most important and are designated as "major headings" by MEDLINE indexers. We introduce and evaluate a graph-based indexing methodology called MEDRank that generates concept graphs from biomedical text and then ranks the concepts within these graphs to identify the most important ones. METHODS We insert a MEDRank step into the MTI and compare MTI's output with and without MEDRank to the MEDLINE indexers' selected terms for a sample of 11,803 PubMed Central articles. We also tested whether human raters prefer terms generated by the MEDLINE indexers, MTI without MEDRank, and MTI with MEDRank for a sample of 36 PubMed Central articles. RESULTS MEDRank improved recall of major headings designated by 30% over MTI without MEDRank (0.489 vs. 0.376). Overall recall was only slightly (6.5%) higher (0.490 vs. 0.460) as was F(2) (3%, 0.408 vs. 0.396). However, overall precision was 3.9% lower (0.268 vs. 0.279). Human raters preferred terms generated by MTI with MEDRank over terms generated by MTI without MEDRank (by an average of 1.00 more term per article), and preferred terms generated by MTI with MEDRank and the MEDLINE indexers at the same rate. CONCLUSIONS The addition of MEDRank to MTI significantly improved the retrieval of core concepts in MEDLINE abstracts and more closely matched human expectations compared to MTI without MEDRank. In addition, MEDRank slightly improved overall recall and F(2).