• Corpus ID: 216056360

CORD-19: The COVID-19 Open Research Dataset

  title={CORD-19: The COVID-19 Open Research Dataset},
  author={Lucy Lu Wang and Kyle Lo and Yoganand Chandrasekhar and Russell Reas and Jiangjiang Yang and Darrin Eide and Kathryn Funk and Rodney Michael Kinney and Ziyang Liu and William Cooper Merrill and Paul Mooney and Dewey A. Murdick and Devvret Rishi and Jerry Sheehan and Zhihong Shen and Brandon Stilson and Alex D Wade and Kuansan Wang and Christopher Wilhelm and Boya Xie and Douglas A. Raymond and Daniel S. Weld and Oren Etzioni and Sebastian Kohlmeier},
The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus research. CORD-19 is designed to facilitate the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers. Since its release, CORD-19 has been downloaded over 200K times and has served as the basis of many COVID-19 text mining and discovery systems. In this article, we describe the mechanics… 

Figures and Tables from this paper

COVIDSeer: Extending the CORD-19 Dataset

An enhanced version of CORD-19 dataset is developed and a vertical search engine COVIDSeer is built based on the new dataset, which offers keyphrase-enhanced search and will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature.

COVID19 Drug Repository: text-mining the literature in search of putative COVID19 therapeutics

The COVID19 Drug Repository enables users to focus on different levels of complexity, starting from general information about (FDA-) approved drugs, PubMed references, clinical trials, recipes as well as the descriptions of molecular mechanisms of drugs’ action.

A scientometric overview of CORD-19

Based on a comparison to the Web of Science database, it is found that CORD-19 provides an almost complete coverage of research on COVID-19 and coronaviruses.

Repurposing TREC-COVID Annotations to Answer the Key Questions of CORD-19

This work repurposes the relevancy annotations for TREC-COVID tasks to identify journal articles in CORD-19 which are relevant to the key questions posed by Cord-19, and presents the methodology used to construct the new dataset.

Covid-on-the-Web: Knowledge Graph and Services to Advance COVID-19 Research

The Covid-on-the-Web project aims to allow biomedical researchers to access, query and make sense of COVID-19 related literature, and adapts, combines and extends tools to process, analyze and enrich the "CO VID-19 Open Research Dataset" (CORD-19).

Using Machine Learning Algorithms for Finding the Topics of COVID-19 Open Research Dataset Automatically

The topic modeling pipeline presented in this thesis helps researchers gain an overview of the topics addressed in the papers of COVID-19, SARS-CoV-2, and related coronaviruses curated by the Allen Institute for AI.

LitCovid: an open database of COVID-19 literature

LitCovid is the first-of-its-kind COVID-19-specific literature resource, with all of its collected articles and curated data freely available, and has been widely used, with millions of accesses by users worldwide for various information needs.

COVIDSeer : Filling missing pieces in the CORD-19 dataset

An enhanced version of CORD19 dataset is developed and a vertical search engine COVIDSeer is built based on the new dataset, which offers keyphrase-enhanced search and will hopefully help biomedical and life science researchers, medical students, and the general public to explore coronavirus-related literature more effectively.

SAPGraph: Structure-aware Extractive Summarization for Scientific Papers with Heterogeneous Graph

SAPGraph is a scientific paper extractive summarization framework based on a structure-aware heterogeneous graph, which models the document into a graph with three kinds of nodes and edges based on structure information of facets and knowledge.



Information Mining for COVID-19 Research From a Large Volume of Scientific Literature

A graph-based model is developed using abstracts of 10,683 scientific articles to find key information on three topics: transmission, drug types, and genome research related to coronavirus to expedite and recommend new and alternative directions for COVID-19 research.

COVID-19 and Inflammatory Bowel Diseases: Risk Assessment, Shared Molecular Pathways, and Therapeutic Challenges

Using current understanding of SARS-CoV-2 as well as other pathogenic coronaviruses immunopathology, it is shown why IBD patients should not be considered at an increased risk of infection or more severe outcomes.

Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision

This CORD-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus covers 75 fine-grained entity types, which may benefit research on CO VID-19 related virus, spreading mechanisms, and potential vaccines.

Identifying Radiological Findings Related to COVID-19 from Medical Literature

This work develops natural language processing methods to analyze a large collection of COVID-19 literature containing study reports from hospitals all over the world, reconcile these results, and draw unbiased and universally-sensible conclusions about the correlation between radiological findings and CO VID-19.

TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection

TREC-COVID is a community evaluation designed to build a test collection that captures the information needs of biomedical researchers using the scientific literature during a pandemic. One of the

Tuberculosis and COVID-19 in 2020: lessons from the past viral outbreaks and possible future outcomes

Investigating the pathological pathways linking TB and SARS-CoV-2 leads to the idea that their coexistence might yield a more severe clinical evolution, and the issues of vaccination and diagnostic reliability in the context of coinfection are addressed.

TREC-COVID: rationale and structure of an information retrieval shared task for COVID-19

TREC-COVID differs from traditional IR shared task evaluations with special considerations for the expected users, IR modality considerations, topic development, participant requirements, assessment process, relevance criteria, evaluation metrics, iteration process, projected timeline, and the implications of data use as a post-task test collection.

Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset

The Neural Covidex is a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset (CORD-19) curated by the Allen

Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing

CoVex renders COVID-19 drug research systems-medicine-ready by giving the scientific community direct access to network medicine algorithms and investigates recent hypotheses on a systems biology level to explore mechanistic virus life cycle drivers, and to extract drug repurposing candidates.

PMC text mining subset in BioC: about three million full-text articles and growing

To facilitate automated processing of nearly 3 million full-text articles (in PMC Open Access and Author Manuscript subsets) and to improve interoperability, BioC, a community-driven simple data structure in either XML or JSON format is converted.