AMMU - A Survey of Transformer-based Biomedical Pretrained Language Models

  title={AMMU - A Survey of Transformer-based Biomedical Pretrained Language Models},
  author={Katikapalli Subramanyam Kalyan and Ajit Rajasekharan and S. Sangeetha},
  journal={Journal of biomedical informatics},
GeMI: interactive interface for transformer-based Genomic Metadata Integration.
Genomic Metadata Integration (GeMI), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments, allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation.
  • 2022
A Comparative Evaluation Of Transformer Models For De-Identification Of Clinical Text Data
Transformer models architectures (after suitable hyper-parameter optimization) offer a satisfactory solution for the clinical text de-identification problem; and could be readily adopted in clinical scenarios where clinicians/researchers are looking to use de-identified clinical text data to facilitate quality improvement and enhanced patient care.
Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT
A domain-specific adaptation of a lite bidirectional encoder representations from transformers (ALBERT) architecture, trained on biomedical and clinical data and fine-tuned for 6 different tasks across 20 benchmark datasets, showing that the model is robust and generalizable in the common BioNLP tasks.
Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario
This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and


Publicly Available Clinical BERT Embeddings
This work explores and releases two BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically, and demonstrates that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset.
UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus
By applying a novel knowledge augmentation strategy, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models on common named-entity recognition (NER) and clinical natural language inference tasks.
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
It is shown that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models.
BEHRT: Transformer for Electronic Health Records
This study introduces BEHRT: A deep neural sequence transduction model for electronic health records (EHR), capable of simultaneously predicting the likelihood of 301 conditions in one’s future visits and shows a striking improvement over the existing state-of-the-art deep EHR models.
November. Learning from Unlabelled Data for Clinical Semantic Textual Similarity
  • In Proceedings of the 3rd Clinical Natural Language Processing Workshop (pp. 227-233)
  • 2020
A pre-training and self-training approach for biomedical named entity recognition.
In NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data.
Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training Using Multi-Task Learning
This study demonstrates that IIT-MTL is an effective way to leverage annotated data from related tasks to improve performance on a target task with a limited data set and opens new avenues of exploration for optimized data set selection to generate more robust and universal contextual representations of text in the clinical domain.
Conceptualized Representation Learning for Chinese Biomedical Text Mining
This paper investigates how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and proposes a novel conceptualized representation learning approach and releases a new Chinese Biomedical Language Understanding Evaluation benchmark.
On Adversarial Examples for Biomedical NLP Tasks
This work proposes an adversarial evaluation scheme on two well-known datasets for medical NER and STS, and proposes two types of attacks inspired by natural spelling errors and typos made by humans that can improve the robustness of the models by training them with adversarial examples.
The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products and a baseline model for named entity recognition (NER) and multi-label sentence classification tasks is presented.