Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions

@article{Kang2020DocumentLevelDD,
  title={Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions},
  author={Dongyeop Kang and Andrew Head and Risham Sidhu and Kyle Lo and Daniel S. Weld and Marti A. Hearst},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.05129}
}
The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in realworld applications. In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition… 

Figures and Tables from this paper

Acronym Extraction with Hybrid Strategies

This work first applies pre-trained models to obtain contextualized text encoding, then employs a sequence labeling strategy with BiLSTM and CRF to tag each word in a sentence and adopts adversarial training to improve the robustness and generalization ability of the models.

ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts

ACCoRD, an end-to-end system tack-ling the novel task of generating sets of descriptions of scientific concepts, is presented and a user study is conducted demonstrating that users prefer descriptions produced by the system, and users prefer multiple descriptions to a single “best” description.

Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols

This work introduces ScholarPhi, an augmented reading interface with four novel features: tooltips that surface position-sensitive definitions from elsewhere in a paper, a filter over the paper that “declutters” it to reveal how the term or symbol is used across the paper, automatic equation diagrams that expose multiple definitions in parallel, and an automatically generated glossary of important terms and symbols.

CDM: Combining Extraction and Generation for Definition Modeling

This paper proposes to combine extraction and generation for definition modeling: first extract self and correlative definitional information of target terms from the Web and then generate the final definitions by incorporating the extracted definitional Information.

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

A new deep learning model is proposed which utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence and outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset.

Modeling Mathematical Notation Semantics in Academic Papers

The extent to which natural language models can learn semantics between mathematical notation and their surrounding text is explored, and a model that selectively masks notation tokens and encodes left and/or right sentences as context is trained.

Understanding Jargon: Combining Extraction and Generation for Definition Modeling

This paper proposes to combine extraction and generation for jargon def-inition modeling: first extract self- and correlative definitional information of target jargon from the Web and then generate the def-initions by incorporating the extracted de-nitional Information.

Hammer PDF: An Intelligent PDF Reader for Scientific Papers

The proposed Hammer PDF Reader can help researchers, especially those studying computer science, to improve the efficiency and experience of reading scientific papers.

Utilizing Text Structure for Information Extraction

This survey will review the structure-based deep models proposed for various IE tasks and also other related NLP tasks and the limitations of the existing models and the potentials for future work.

AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21

This paper presents an Adversarial Training BERT method named AT-BERT, the winning solution to acronym identification task for Scientific Document Understanding (SDU) Challenge of AAAI 2021, which incorporates the FGM adversarial training strategy into the fine-tuning of BERT, which makes the model more robust and generalized.

References

SHOWING 1-10 OF 33 REFERENCES

Automated Discovery of Mathematical Definitions in Text

This paper investigates automatic detection of one-sentence definitions in mathematical texts, which are difficult to separate from surrounding text and applies deep learning methods, such as convolutional neural network and recurrent neural network, in order to identify mathematical definitions.

A Joint Model for Definition Extraction with Syntactic Connection and Semantic Consistency

This work proposes a novel model for DE that simultaneously performs the two tasks in a single framework to benefit from their inter-dependencies and presents a multi-task learning framework that employs graph convolutional neural networks and predicts the dependency paths between the terms and the definitions.

Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

DefMiner is presented, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions and achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%.

Syntactically Aware Neural Architectures for Definition Extraction

This paper presents a set of neural architectures combining Convolutional and Recurrent Neural Networks, which are further enriched by incorporating linguistic information via syntactic dependencies, and demonstrates that models trained on clean Wikipedia-like definitions can successfully be applied to more noisy domain-specific corpora.

Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols

This work introduces ScholarPhi, an augmented reading interface with four novel features: tooltips that surface position-sensitive definitions from elsewhere in a paper, a filter over the paper that “declutters” it to reveal how the term or symbol is used across the paper, automatic equation diagrams that expose multiple definitions in parallel, and an automatically generated glossary of important terms and symbols.

Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation

MASI, distance metric for comparing sets, is discussed and its use in quantifying the reliability of a specific dataset is illustrated, and it is argued that a paradigmatic reliability study should relate measures of inter-annotator agreement to independent assessments, such as significance tests of the annotated variables with respect to other phenomena.

Definition Extraction with LSTM Recurrent Neural Networks

This work model definition extraction as a supervised sequence classification task and proposes a new way to automatically generate sentence features using a Long Short-Term Memory neural network model, which outperforms the current state-of-the-art methods by 5.8 %.

Learning Word-Class Lattices for Definition and Hypernym Extraction

This paper proposes Word-Class Lattices, a generalization of word lattices that is applied to the task of definition and hypernym extraction and compares favorably to other pattern generalization methods proposed in the literature.

Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern bootstrapping and deep analysis

A comparative study of two approaches to extracting definitional sentences from a corpus of scholarly discourse: one based on bootstrapping lexico-syntactic patterns and another based on deep analysis show that both methods extract high-quality definition sentences intended for automated glossary construction.

S2ORC: The Semantic Scholar Open Research Corpus

In S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines is introduced, which is expected to facilitate research and development of tools and tasks for text mining over academic text.