Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision

  title={Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision},
  author={Hang Dong and V'ictor Su'arez-Paniagua and Huayu Zhang and Minhong Wang and Emma Whitfield and Honghan Wu},
  journal={2021 43rd Annual International Conference of the IEEE Engineering in Medicine \& Biology Society (EMBC)},
The identification of rare diseases from clinical notes with Natural Language Processing (NLP) is challenging due to the few cases available for machine learning and the need of data annotation from clinical experts. We propose a method using ontologies and weak supervision. The approach includes two steps: (i) Text-to-UMLS, linking text mentions to concepts in Unified Medical Language System (UMLS), with a named entity linking tool (e.g. SemEHR) and weak supervision based on customised rules… 

Figures and Tables from this paper

Ontology-Based and Weakly Supervised Rare Disease Phenotyping from Clinical Notes
The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts, and discusses the usefulness of the weak supervision approach.
Automated Clinical Coding: What, Why, and Where We Are?
The idea of automated clinical coding is introduced and its challenges are summarized from the perspective of Artificial Intelligence (AI) and Natural Language Processing (NLP), based on the literature, the project experience over the past two and half years, and discussions with clinical coding experts in Scotland and the UK.
Deep Learning for Rare Disease: A Scoping Review
This study reviewed the current uses of deep learning to advance rare disease research and found that deep learning has been actively used for rare neoplastic diseases, followed by rare genetic diseases and rare neurological diseases, and convolutional neural networks were the most frequently used deep learning architecture.
Using Symptoms and Healthcare Encounters to Capture a Rare Disease: A Study of Clinical Notes of the Alpha‐Gal Meat Allergy
  • Yuanye MaM. Flaherty
  • Medicine
    Proceedings of the Association for Information Science and Technology
  • 2021
This in‐depth analysis of clinical notes of AGS can serve as a basis for future automation of rare disease analysis and provides a basic understanding of the granularity of information that an electronic health record (EHR) may provide for rare disease identification.


An Ontology-Based Approach to Estimate the Frequency of Rare Diseases in Narrative-Text Radiology Reports
Automated ontology-based search of Radiology reports can estimate the frequency of rare diseases, and those diseases with higher known prevalence were significantly more likely to appear in radiology reports.
Bio-YODIE: A Named Entity Linking System for Biomedical Text
This work presents a new system, Bio-YODIE, and compares it to two other popular systems in order to give guidance about suitable approaches in different scenarios and how systems might be designed to accommodate future needs.
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
The Biomedical Language Understanding Evaluation (BLUE) benchmark is introduced to facilitate research in the development of pre-training language representations in the biomedicine domain and it is found that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results.
Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces
This paper develops few- and zero-shot methods for multi-label text classification when there is a known structure over the label space, and evaluates them on two publicly available medical text datasets: MIMIC II and MIMic III.
[Orphanet: a European database for rare diseases].
The database can be accessed through the website (www.orpha.net) and has some interesting options for searching, for example research projects, support groups or searching by clinical signs.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
International classification of diseases.
The leading causes of death are determined using a specific tabulation list and rules for ranking. In ICD-10, the 113 cause list is used for ranking except when ranking infant causes separately. The
Scikit-learn: Machine Learning in Python
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing