SciREX: A Challenge Dataset for Document-Level Information Extraction

@inproceedings{Jain2020SciREXAC,
  title={SciREX: A Challenge Dataset for Document-Level Information Extraction},
  author={Sarthak Jain and Madeleine van Zuylen and Hannaneh Hajishirzi and Iz Beltagy},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}
Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level… 

Figures and Tables from this paper

Document-level Entity-based Extraction as Template Generation

A generative framework is proposed for two document-level EE tasks: role-filler entity extraction (REE) and relation extraction (RE), allowing models to efficiently capture cross-entity dependencies, exploit label semantics, and avoid the exponential computation complexity of identifying N-ary relations.

Seq2rel: A sequence-to-sequence-based approach for document-level relation extraction

  • Computer Science
  • 2021
This paper develops a sequence-to-sequence-based approach that can learn the sub-tasks of document-level RE — entity extraction, coreference resolution and relation extraction — in an end-To-end fashion and demonstrates that, under this model, the end- to-end approach outperforms a pipeline-based approaches.

Joint Detection and Coreference Resolution of Entities and Events with Document-level Context Aggregation

By extending the jointly trained model to document-level, this work can improve the results by incorporating cross-sentence dependencies and additional contextual information that might not be available at the sentence level, which allows for more globally optimized predictions.

A sequence-to-sequence approach for document-level relation extraction

This paper develops a sequence-to-sequence approach, seq2rel, that can learn the subtasks of DocRE end- to-end, replacing a pipeline of task-specific components, and demonstrates that, under the model, an end-To-end approach outperforms a pipeline-based approach.

Automatic Error Analysis for Document-level Information Extraction

This work builds on the work of Kummerfeld and Klein (2013) to propose a transformation-based framework for automating error analysis in document-level event and (N-ary) relation extraction, and compares two state-of-the-art document- level template-filling approaches on datasets from three domains.

ArgFuse: A Weakly-Supervised Framework for Document-Level Event Argument Aggregation

An extractive algorithm with multiple sieves which adopts active learning strategies to work efficiently in low-resource settings and is the first to establish baseline results for this task in English.

Extraction of Competing Models using Distant Supervision and Graph Ranking

This task of detection of competing model entities from scientific documents is introduced and will serve as an important starting point to map the research landscape of computer science in a scalable manner, needing minimal human intervention.

Efficient End-to-end Learning of Cross-event Dependencies for Document-level Event Extraction

This paper proposes an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction and achieves comparable performance to CRF-based model on ACE05, while enjoys significantly higher efficiency.

Few-Shot Document-Level Relation Extraction

This work adapts the state-of-the-art sentence-level method MNAV to the document-level and develops it further for improved domain adaptation and finds FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set.

ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select

The proposed method RESEL decomposes this task into a two-stage procedure that first retrieves the most relevant paragraph/table and then selects the target entity from the retrieved component.
...

References

SHOWING 1-10 OF 26 REFERENCES

DocRED: A Large-Scale Document-Level Relation Extraction Dataset

Empirical results show that DocRED is challenging for existing RE methods, which indicates that document-level RE remains an open problem and requires further efforts.

Document-Level N-ary Relation Extraction with Multiscale Representation Learning

This paper proposes a novel multiscale neural architecture for document-level n-ary relation extraction that combines representations learned over various text spans throughout the document and across the subrelation hierarchy.

Entity, Relation, and Event Extraction with Contextualized Span Representations

This work examines the capabilities of a unified, multi-task framework for three information extraction tasks: named entity recognition, relation extraction, and event extraction (called DyGIE++) and achieves state-of-the-art results across all tasks.

Modeling Relations and Their Mentions without Labeled Text

A novel approach to distant supervision that can alleviate the problem of noisy patterns that hurt precision by using a factor graph and applying constraint-driven semi-supervision to train this model without any knowledge about which sentences express the relations in the authors' training KB.

Supervised Open Information Extraction

A novel formulation of Open IE as a sequence tagging problem, addressing challenges such as encoding multiple extractions for a predicate, and a supervised model that outperforms the existing state-of-the-art Open IE systems on benchmark datasets.

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

Dependency-Guided LSTM-CRF for Named Entity Recognition

This work proposes a simple yet effective dependency-guided LSTM-CRF model to encode the complete dependency trees and capture the above properties for the task of named entity recognition (NER).

Position-aware Attention and Supervised Data Improve Slot Filling

An effective new model is proposed, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction that builds TACRED, a large supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations.

Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction

The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links and supports construction of a scientific knowledge graph, which is used to analyze information in scientific literature.

End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures

A novel end-to-end neural model to extract entities and relations between them and compares favorably to the state-of-the-art CNN based model (in F1-score) on nominal relation classification (SemEval-2010 Task 8).