High-Precision Extraction of Emerging Concepts from Scientific Literature

  title={High-Precision Extraction of Emerging Concepts from Scientific Literature},
  author={Daniel King and Doug Downey and Daniel S. Weld},
  journal={Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • Daniel KingDoug DowneyDaniel S. Weld
  • Published 11 June 2020
  • Computer Science
  • Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
Identification of new concepts in scientific literature can help power faceted search, scientific trend analysis, knowledge-base construction, and more, but current methods are lacking. Manual identification can't keep up with the torrent of new publications, while the precision of existing automatic techniques is too low for many applications. We present an unsupervised concept extraction method for scientific literature that achieves much higher precision than previous work. Our approach… 

Figures and Tables from this paper

The impact of preprint servers in the formation of novel ideas

A Bayesian method to estimate the time of appearance for a phrase in the literature is developed, and it is seen that presently most phrases appear first in the traditional journals, but there is a number of phrases with the first appearance on preprint servers.

ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts

ACCoRD, an end-to-end system tack-ling the novel task of generating sets of descriptions of scientific concepts, is presented and a user study is conducted demonstrating that users prefer descriptions produced by the system, and users prefer multiple descriptions to a single “best” description.

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

PINOCCHIO is presented, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations.

Metrics and Mechanisms: Measuring the Unmeasurable in the Science of Science

Towards Personalized Descriptions of Scientific Concepts

This paper proposes generating personalized scientific concept descriptions that are tailored to the user’s expertise and context and outlines a complete architecture for the task and releases an expert-annotated resource, ACCoRD.

README: A Literature Survey Assistant

Literature review is an integral element of academic research, enabling researchers to learn about and build on existing work. Traditionally, this involves manually going through various published



TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

An iterative approach for training NER and NET classifiers in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type, is presented.

Extracting Keyphrases from Research Papers Using Citation Networks

This work proposes CiteTextRank for keyphrase extraction from research articles, a graph-based algorithm that incorporates evidence from both a document's content as well as the contexts in which the document is referenced within a citation network.

A frequent keyword-set based algorithm for topic modeling and clustering of research papers

A novel and efficient approach to detect topics in a large corpus of research papers using closed frequent keyword-set to form topics and a modified PageRank algorithm that assigns an authoritative score to each research paper by considering the sub-graph in which the research paper appears.

Construction of the Literature Graph in Semantic Scholar

This paper reduces literature graph construction into familiar NLP tasks, point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task.

Detecting research topics via the correlation between graphs and texts

This paper presents a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term.

Phrases as subtopical concepts in scholarly text

This work presents a method to extract "phrase" phrases from a text corpus, and rank them using a citation network measure, the compensated normalized link count (CNLC), which measures the extent to which they are propagated along the citation structure of articles.

A review of keyphrase extraction

This article introduces keyphrase extraction, provides a well‐structured review of the existing work, offers interesting insights on the different evaluation approaches, highlights open issues and presents a comparative experimental study of popular unsupervised techniques on five datasets.

Self-taught hashing for fast similarity search

This paper proposes a novel Self-Taught Hashing (STH) approach to semantic hashing: it first finds the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l- bit code for any query document unseen before.

Latent Dirichlet Allocation

Bursty and Hierarchical Structure in Streams

  • J. Kleinberg
  • Computer Science
    Data Mining and Knowledge Discovery
  • 2004
The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content.