Construction of the Literature Graph in Semantic Scholar

  title={Construction of the Literature Graph in Semantic Scholar},
  author={Waleed Ammar and Dirk Groeneveld and Chandra Bhagavatula and Iz Beltagy and Miles Crawford and Doug Downey and Jason Dunkelberger and Ahmed Elgohary and Sergey Feldman and Vu A. Ha and Rodney Michael Kinney and Sebastian Kohlmeier and Kyle Lo and Tyler C. Murray and Hsu-Han Ooi and Matthew E. Peters and Joanna L. Power and Sam Skjonsberg and Lucy Lu Wang and Christopher Wilhelm and Zheng Yuan and Madeleine van Zuylen and Oren Etzioni},
We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to… 

Figures and Tables from this paper

GrapAL: Querying Semantic Scholar's Literature Graph

The basic elements of GrapAL are described, how to use it, and several use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities, and computing citation-based metrics are described.

Triple Classification for Scholarly Knowledge Graph Completion

This work presents exBERT, a method for leveraging pre-trained transformer language models to perform scholarly knowledge graph completion, and presents two scholarly datasets as resources for the research community, collected from public KGs and online resources.

End-to-End NLP Knowledge Graph Construction

This paper applies the SciNLP-KG framework to 30,000 NLP papers from ACL Anthology to build a large-scale KG, which can facilitate automatically constructing scientific leaderboards for the NLP community and indicates that the resulting KG contains high-quality information.

Open Information Extraction for Knowledge Graph Construction

The proposed OIE4KGC approach takes a document corpus and identifies triples within this corpus which are then processed to generate a literature knowledge graph.

Scalable, Semi-Supervised Extraction of Structured Information from Scientific Literature

A novel, scalable, semi-supervised method for extracting relevant structured information from the vast available raw scientific literature by extracting the fundamental concepts of “aim”, ”method” and “result” from scientific articles and using them to construct a knowledge graph.

From Books to Knowledge Graphs

A bottom-up approach to support publishers in creating and maintaining their own publication knowledge graphs in the open domain is proposed by releasing a pipeline able to extract structured information from the bibliographies and indexes of AHSS publications, disambiguate, normalize and export it as linked data.

GrapAL: Connecting the Dots in Scientific Literature

The basic elements of GrapAL are described, how to use it, and several use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities, and computing citation-based metrics are described.

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links and supports construction of a scientific knowledge graph, which is used to analyze information in scientific literature.

Improving Access to Scientific Literature with Knowledge Graphs

A scholarly knowledge graph can be used to give a condensed overview on the state-of-the-art addressing a particular research quest, for example as a tabular comparison of contributions according to various characteristics of the approaches.

GORC: A large contextual citation graph of academic papers

We introduce the Semantic Scholar Graph of References in Context (GORC),1 a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers,



Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding

Explicit Semantic Ranking is introduced, a new ranking technique that leverages knowledge graph embedding that represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledgegraph embedding.

Swanson linking revisited: Accelerating literature-based discovery across domains using a conceptual influence graph

It is demonstrated that this deep reading and search system reduces the effort needed to uncover “undiscovered public knowledge”, and that with the aid of this tool a domain expert was able to drastically reduce her model building time from months to two days.

TabEL: Entity Linking in Web Tables

TabEL differs from previous work by weakening the assumption that the semantics of a table can be mapped to pre-defined types and relations found in the target KB, and enforces soft constraints in the form of a graphical model that assigns higher likelihood to sets of entities that tend to co-occur in Wikipedia documents and tables.

Content-Based Citation Recommendation

It is shown empirically that, although adding metadata improves the performance on standard metrics, it favors self-citations which are less useful in a citation recommendation setup and released an online portal for citation recommendation based on this method.

TAGME: on-the-fly annotation of short text fragments (by wikipedia entities)

We designed and implemented TAGME, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of TAGME with respect to known

Design Challenges for Entity Linking

This work analyzes differences between several versions of the EL problem and presents a simple yet effective, modular, unsupervised system, called Vinculum, for entity linking, and elucidate key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence.

Identifying Meaningful Citations

This work introduces the novel task of identifying important citations in scholarly literature, i.e., citations that indicate that the cited work is used or extended in the new effort, and proposes a supervised classification approach that addresses this task with a battery of features.

CiteSeerX: AI in a Digital Library Search Engine

This work presents key AI technologies used in the following components of CiteSeerX: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation.

SemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and

Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

A method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE is presented and allows the evaluation of WSD algorithms in the biomedical domain.