Document Classification for COVID-19 Literature

  title={Document Classification for COVID-19 Literature},
  author={Bernal Jimenez Gutierrez and Juncheng Zeng and Dongdong Zhang and Ping Zhang and Yu Su},
The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset. We find that pre-trained language models outperform other models in both low and high data regimes, achieving a maximum F1 score of around 86%. We note that even the highest performing models still struggle… 

Figures and Tables from this paper

Repurposing TREC-COVID Annotations to Answer the Key Questions of CORD-19

This work repurposes the relevancy annotations for TREC-COVID tasks to identify journal articles in CORD-19 which are relevant to the key questions posed by Cord-19, and presents the methodology used to construct the new dataset.

Annotating the Pandemic: Named Entity Recognition and Normalisation in COVID-19 Literature

A publicly available pipeline to perform named entity recognition and normalisation in parallel to help find relevant publications and to aid in downstream NLP tasks such as text summarisation is presented.

Answering Questions on COVID-19 in Real-Time

CovidAsk, a question answering (QA) system that combines biomedical text mining and QA techniques to provide answers to questions in real-time, is outlined, which leverages both supervised and unsupervised approaches to provide informative answers.

“The coronavirus is a bioweapon”: classifying coronavirus stories on fact-checking sites

This work characterises stories reported by fact-checking groups PolitiFact, Poynter and Snopes from January to June 2020, then characterises these stories into six clusters, and analyses temporal trends of story validity and the level of agreement across sites.

Harmonic Means between TF-IDF and Angle of Similarity to Identify Prospective Applicants in a Recruitment Setting

A combination of angle or similarity and term frequency–inverse document frequency to easily classify prospective job applicants is proposed and it can be concluded that harmonic similarity is viable in combining the two models.

A Comparison of Multi-Label Text Classification Models in Research Articles Labeled With Sustainable Development Goals

This article compares the performance of multi-label text classification models based on a proposed framework with datasets of different characteristics and shows that the combination of Label Powerset with Support Vector Machine can achieve an accuracy of up to 87% for an imbalanced dataset, 83% for a dataset with the same number of instances per label, and even 91%" for a multiclass dataset.

LitMC-BERT: Transformer-Based Multi-Label Classification of Biomedical Literature With An Application on COVID-19 Literature Curation

This study proposes LITMC-BERT, a transformer-based multi-label classification method in biomedical literature that uses a shared transformer backbone for all the labels while also captures label-specific features and the correlations between label pairs.

International Journal of Electrical and Computer Engineering (IJECE)

Among the three models, the fine-tuned VGG-16 model was found to perform better attaining a very high accuracy on the dataset, and the convolutional neural network (CNN) and AlexNet model attained high accuracies.

Do We Need a Specific Corpus and Multiple High-Performance GPUs for Training the BERT Model? An Experiment on COVID-19 Dataset

A method of making an unsupervised model called a zero-shot classification model, based on the pre-trained BERT model, which has an accuracy of 27.84%, which is lower than the best-achieved accuracy by 6.73%, but it is comparable.

Identifying and Characterizing Active Citizens who Refute Misinformation in Social Media

This paper develops and makes publicly available a new dataset of Weibo users mapped into one of the two categories (i.e., misinformation posters or active citizens), and presents an extensive analysis of the differences in language use between the two user categories.



CORD-19: The Covid-19 Open Research Dataset

The mechanics of dataset construction are described, highlighting challenges and key design decisions, an overview of how CORD-19 has been used, and several shared tasks built around the dataset are described.

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

This work presents an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating, based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies.

ML-Net: multi-label classification of biomedical texts with deep neural networks

OBJECTIVE In multi-label text classification, each textual document is assigned 1 or more labels. As an important task that has broad applications in biomedicine, a number of different computational

Rethinking Complex Neural Network Architectures for Document Classification

In a large-scale reproducibility study of several recent neural models, it is found that a simple BiLSTM architecture with appropriate regularization yields accuracy and F1 that are either competitive or exceed the state of the art on four standard benchmark datasets.

Convolutional neural networks for biomedical text classification: application in indexing biomedical articles

This paper uses convolutional neural networks to build binary text classifiers and achieves an absolute improvement of over 3% in macro F-score over a set of selected hard-to-classify MeSH terms when compared with the best prior results on a public dataset.

Deep Learning for Extreme Multi-label Text Classification

This paper presents the first attempt at applying deep learning to XMTC, with a family of new Convolutional Neural Network models which are tailored for multi-label classification in particular.

Automatic categorization of diverse experimental information in the bioscience literature

An automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM) is developed and can be readily incorporated to different workflow at different literature-based databases.

Automatic semantic classification of scientific literature according to the hallmarks of cancer

This work introduces a corpus of 1499 PubMed abstracts annotated according to the scientific evidence they provide for the 10 currently known hallmarks of cancer, and uses this corpus to train a system that classifies PubMed literatureaccording to the hallmarks.

Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models

Several architectures for modeling pheno-typing that rely solely on BERT representations of the clinical note are explored, finding these architectures are competitive with or outperform existing state of the art methods on two phenotyping tasks.

DocBERT: BERT for Document Classification

It is shown that a straightforward classification model using BERT is able to achieve the state of the art across four popular datasets, and distill knowledge from BERT-large to small bidirectional LSTMs, reaching Bert-base parity on multiple datasets using 30x fewer parameters.