PubMed Phrases, an open set of coherent phrases for searching biomedical literature

  title={PubMed Phrases, an open set of coherent phrases for searching biomedical literature},
  author={Sun Kim and Lana Yeganova and Donald C. Comeau and W. John Wilbur and Zhiyong Lu},
  journal={Scientific Data},
In biomedicine, key concepts are often expressed by multiple words (e.g., ‘zinc finger protein’). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the… 

PMCVec: Distributed phrase representation for biomedical text processing

A novel MEDLINE topic indexing method using image presentation

PubMed Author-assigned Keyword Extraction (PubMedAKE) Benchmark

Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.

MeSH-based dataset for measuring the relevance of text retrieval

This work selects a suitable subset of MeSH terms as queries, and utilizes MeSH term assignments as pseudo-relevance rankings for retrieval evaluation, and uses the proposed retrieval evaluation framework to better understand how to combine heterogeneous sources of textual information.

Robust Representation Learning of Biomedical Names

The idea behind the approach is to consider and encode contextual meaning, conceptual meaning, and the similarity between synonyms during the representation learning process, resulting in high practical utility in real-world applications.

A reference set of curated biomedical data and metadata from clinical case reports

A standardized metadata template and MACCR set are developed that render CCRs more findable, accessible, interoperable, and reusable while serving as valuable resources for key user groups, including researchers, physician investigators, clinicians, data scientists, and those shaping government policies for clinical trials.

A graph-based method for reconstructing entities from coordination ellipsis in medical text

RECEEM improves concept normalization for medical coordinated elliptical expressions in a variety of biomedical corpora and outperformed existing methods and significantly enhanced the performance of 2 notable NLP systems for mapping coordination ellipses in the evaluation.

Fast searches of large collections of single cell data using scfind

Using transcriptome data from mouse cell atlases, scfind can be used to evaluate marker genes, to perform in silico gating, and to identify both cell-type specific and housekeeping genes, and a subquery optimization routine is developed to ensure that long and complex queries return meaningful results.

Epione application: An integrated web-toolkit of clinical genomics and personalized medicine in systemic lupus erythematosus

The Epione application is presented, an integrated bioinformatics web-toolkit designed to assist medical experts and researchers in more accurately diagnosing SLE, and may assist and facilitate in early stage diagnosis by using the patients' genomic profile to compare against the list of the most predictable candidate gene variants related to SLE.



How to interpret PubMed queries and why it matters

An automated retrieval evaluation method is developed, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes and shows that the class of records that contain all the search terms, but not the phrase, qualitatively differs from theclass of records containing the phrase.

Summarizing Topical Contents from PubMed Documents Using a Thematic Analysis

A method that finds sub-topics that are referred to as themes and computes representative titles based on a set of documents in each theme is proposed, which outperformed LDA and outperformed MeSH r terms.

Extracting noun phrases for all of MEDLINE

The extraction of noun phrases from MEDLINE is discussed, using a general parser not tuned specifically for any medical domain, and it is claimed that a generic parser can effectively extract all the different phrases across the entire medical literature.

Corpus-based statistical screening for phrase identification.

  • W. KimW. Wilbur
  • Computer Science
    Journal of the American Medical Informatics Association : JAMIA
  • 2000
Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.

Understanding PubMed® user search behavior through log analysis

This investigation was conducted through the analysis of one month of log data, consisting of more than 23 million user sessions and more than 58 million user queries, which provided insight into PubMed users’ needs and their behavior.

Meshable: searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms

A web interface is introduced which allows users to enter queries to find MeSH terms closely related to the queries and can be effectively used to find full names of abbreviations and to disambiguate user queries.

Retro: concept-based clustering of biomedical topical sets

Retro-a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections, and is superior to existing methods in terms of quality of clusters.

Click-words: learning to predict document keywords from a user perspective

This model is able to accurately predict the words likely to appear in user queries that lead to document clicks, and suggests that click-words tend to be biomedical entities, to exist in article titles, and to occur repeatedly in article abstracts.