High-Precision Extraction of Emerging Concepts from Scientific Literature

  title={High-Precision Extraction of Emerging Concepts from Scientific Literature},
  author={Daniel King and Doug Downey and Daniel S. Weld},
  journal={Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • Daniel King, Doug Downey, Daniel S. Weld
  • Published 11 June 2020
  • Computer Science
  • Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
Identification of new concepts in scientific literature can help power faceted search, scientific trend analysis, knowledge-base construction, and more, but current methods are lacking. Manual identification can't keep up with the torrent of new publications, while the precision of existing automatic techniques is too low for many applications. We present an unsupervised concept extraction method for scientific literature that achieves much higher precision than previous work. Our approach… 

Figures and Tables from this paper

The impact of preprint servers in the formation of novel ideas
A Bayesian method to estimate the time of appearance for a phrase in the literature is developed, and it is seen that presently most phrases appear first in the traditional journals, but there is a number of phrases with the first appearance on preprint servers.
ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts
ACCoRD, an end-to-end system tack-ling the novel task of generating sets of descriptions of scientific concepts, is presented and a user study is conducted demonstrating that users prefer descriptions produced by the system, and users prefer multiple descriptions to a single “best” description.
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search
PINOCCHIO is presented, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations.
Metrics and Mechanisms: Measuring the Unmeasurable in the Science of Science
Towards Personalized Descriptions of Scientific Concepts
This paper proposes generating personalized scientific concept descriptions that are tailored to the user’s expertise and context and outlines a complete architecture for the task and releases an expert-annotated resource, ACCoRD.
README: A Literature Survey Assistant
Literature review is an integral element of academic research, enabling researchers to learn about and build on existing work. Traditionally, this involves manually going through various published


TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications
An iterative approach for training NER and NET classifiers in scientific publications that relies on minimal human input, namely a small seed set of instances for the targeted entity type, is presented.
Extracting Keyphrases from Research Papers Using Citation Networks
This work proposes CiteTextRank for keyphrase extraction from research articles, a graph-based algorithm that incorporates evidence from both a document's content as well as the contexts in which the document is referenced within a citation network.
A frequent keyword-set based algorithm for topic modeling and clustering of research papers
A novel and efficient approach to detect topics in a large corpus of research papers using closed frequent keyword-set to form topics and a modified PageRank algorithm that assigns an authoritative score to each research paper by considering the sub-graph in which the research paper appears.
Construction of the Literature Graph in Semantic Scholar
This paper reduces literature graph construction into familiar NLP tasks, point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task.
Detecting research topics via the correlation between graphs and texts
This paper presents a unique approach that uses the correlation between the distribution of a term that represents a topic and the link distribution in the citation graph where the nodes are limited to the documents containing the term.
Phrases as subtopical concepts in scholarly text
This work presents a method to extract "phrase" phrases from a text corpus, and rank them using a citation network measure, the compensated normalized link count (CNLC), which measures the extent to which they are propagated along the citation structure of articles.
A review of keyphrase extraction
This article introduces keyphrase extraction, provides a well‐structured review of the existing work, offers interesting insights on the different evaluation approaches, highlights open issues and presents a comparative experimental study of popular unsupervised techniques on five datasets.
Self-taught hashing for fast similarity search
This paper proposes a novel Self-Taught Hashing (STH) approach to semantic hashing: it first finds the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l- bit code for any query document unseen before.
Latent Dirichlet Allocation
Deep Contextualized Word Representations
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.