Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions

  title={Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions},
  author={Dongyeop Kang and Andrew Head and Risham Sidhu and Kyle Lo and Daniel S. Weld and Marti A. Hearst},
The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in realworld applications. In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition… Expand

Figures and Tables from this paper

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation
A new deep learning model is proposed which utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence and outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset. Expand
Utilizing Text Structure for Information Extraction
Information Extraction (IE) is one of the important fields of natural language processing (NLP) with the primary goal of creating structured knowledge from unstructured text. In more than twoExpand
AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21
This paper presents an Adversarial Training BERT method named AT-BERT, the winning solution to acronym identification task for Scientific Document Understanding (SDU) Challenge of AAAI 2021, which incorporates the FGM adversarial training strategy into the fine-tuning of BERT, which makes the model more robust and generalized. Expand
NaturalProofs: Mathematical Theorem Proving in Natural Language
This work develops NATURALPROOFS, a largescale dataset of mathematical statements and their proofs, written in natural mathematical language, and proposes a mathematical reference retrieval task that tests a system’s ability to determine the key results that appear in a proof. Expand
Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols
This work introduces ScholarPhi, an augmented reading interface with four novel features: tooltips that surface position-sensitive definitions from elsewhere in a paper, a filter over the paper that “declutters” it to reveal how the term or symbol is used across the paper, automatic equation diagrams that expose multiple definitions in parallel, and an automatically generated glossary of important terms and symbols. Expand
Modeling Mathematical Notation Semantics in Academic Papers
  • Hwiyeol Jo, Dongyeop Kang
  • 2021
Natural language models often fall short when understanding and generating mathematical notation. What is not clear is whether these shortcomings are due to fundamental limitations of the models, orExpand


Automated Discovery of Mathematical Definitions in Text
This paper investigates automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text and applies deep learning methods, such as convolutional neural network and recurrent neural network, in order to identify mathematical definitions. Expand
A Joint Model for Definition Extraction with Syntactic Connection and Semantic Consistency
This work proposes a novel model for DE that simultaneously performs the two tasks in a single framework to benefit from their inter-dependencies and presents a multi-task learning framework that employs graph convolutional neural networks and predicts the dependency paths between the terms and the definitions. Expand
Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
DefMiner is presented, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions and achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%. Expand
Syntactically Aware Neural Architectures for Definition Extraction
This paper presents a set of neural architectures combining Convolutional and Recurrent Neural Networks, which are further enriched by incorporating linguistic information via syntactic dependencies, and demonstrates that models trained on clean Wikipedia-like definitions can successfully be applied to more noisy domain-specific corpora. Expand
Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation
MASI, distance metric for comparing sets, is discussed and its use in quantifying the reliability of a specific dataset is illustrated, and it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena. Expand
Definition Extraction with LSTM Recurrent Neural Networks
This work model definition extraction as a supervised sequence classification task and proposes a new way to automatically generate sentence features using a Long Short-Term Memory neural network model, which outperforms the current state-of-the-art methods by 5.8 %. Expand
Learning Word-Class Lattices for Definition and Hypernym Extraction
This paper proposes Word-Class Lattices, a generalization of word lattices that is applied to the task of definition and hypernym extraction and compares favorably to other pattern generalization methods proposed in the literature. Expand
Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis
A comparative study of two approaches to extracting definitional sentences from a corpus of scholarly discourse: one based on bootstrapping lexico-syntactic patterns and another based on deep analysis show that both methods extract high-quality definition sentences intended for automated glossary construction. Expand
Learning to Identify Definitions using Syntactic Features
An approach to learning concept definitions which operates on fully parsed text which incorporates features referring to the position of the sentence in the document as well as various syntactic features, gives the best results. Expand
Linguistically-Informed Self-Attention for Semantic Role Labeling
LISA is a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL, and can incorporate syntax using merely raw tokens as input. Expand