SpanBERT: Improving Pre-training by Representing and Predicting Spans

  title={SpanBERT: Improving Pre-training by Representing and Predicting Spans},
  author={Mandar Joshi and Danqi Chen and Yinhan Liu and Daniel S. Weld and Luke Zettlemoyer and Omer Levy},
  journal={Transactions of the Association for Computational Linguistics},
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as… 

A Structured Span Selector

This paper proposes a novel grammar-based structured span selection model which learns to make use of the partial span-level annotation provided for such problems, and gets rid of the heuristic greedy span selection scheme, allowing the downstream task on an optimal set of spans.

Span Selection Pre-training for Question Answering

This paper introduces a new pre-training task inspired by reading comprehension to better align the pre- training from memorization to understanding, and shows that the proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction.

Few-Shot Question Answering by Pretraining Span Selection

This work proposes a new pretraining scheme tailored for question answering: recurring span selection, where masked spans are replaced with a special token that is later used during fine-tuning to select the answer span.

Coreference Resolution as Query-based Span Prediction

An accurate and extensible approach for the coreference resolution task, formulated as a span prediction task, like in machine reading comprehension (MRC), which provides the flexibility of retrieving mentions left out at the mention proposal stage.

Improving Span Representation for Domain-adapted Coreference Resolution

This work develops methods to improve the span representations via a retrofitting loss to incentivize span representations to satisfy a knowledge-based distance function and a scaffolding loss to guide the recovery of knowledge from the span representation.

Studying Strategically: Learning to Mask for Closed-book QA

This paper aims to learn the optimal masking strategy for the intermediate pretraining stage, and first train the masking policy to extract spans that are likely to be tested, using supervision from the downstream task itself, then deploy the learned policy during intermediate pre-training.

A Cross-Task Analysis of Text Span Representations

This paper conducts a comprehensive empirical evaluation of six span representation methods using eight pretrained language representation models across six tasks, including two tasks that are introduced.

ANNA”:" Enhanced Language Representation for Question Answering

This paper proposes an extended pre- training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.

CorefQA: Coreference Resolution as Query-based Span Prediction

CorefQA is presented, an accurate and extensible approach for the coreference resolution task, formulated as a span prediction task, like in question answering, which provides the flexibility of retrieving mentions left out at the mention proposal stage.

A Simple Unsupervised Approach for Coreference Resolution using Rule-based Weak Supervision

This work transfers the linguistic knowledge encoded by Stanford?s rule-based coreference system to the end-to-end model, which jointly learns rich, contextualized span representations and coreference chains in settings where labeled data is unavailable.



End-to-end Neural Coreference Resolution

This work introduces the first end-to-end coreference resolution model, trained to maximize the marginal likelihood of gold antecedent spans from coreference clusters and is factored to enable aggressive pruning of potential mentions.

Learning Recurrent Span Representations for Extractive Question Answering

This paper presents a novel model architecture that efficiently builds fixed length representations of all spans in the evidence document with a recurrent network, and shows that scoring explicit span representations significantly improves performance over other approaches that factor the prediction into separate predictions about words or start and end markers.

BERT for Coreference Resolution: Baselines and Analysis

A qualitative analysis of model predictions indicates that, compared to ELMo and Bert-base, BERT-large is particularly better at distinguishing between related but distinct entities, but that there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

ERNIE: Enhanced Language Representation with Informative Entities

This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks.

Matching the Blanks: Distributional Similarity for Relation Learning

This paper builds on extensions of Harris’ distributional hypothesis to relations, as well as recent advances in learning text representations (specifically, BERT), to build task agnostic relation representations solely from entity-linked text.

Unified Language Model Pre-training for Natural Language Understanding and Generation

A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.

CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes

The OntoNotes annotation (coreference and other layers) is described and the parameters of the shared task including the format, pre-processing information, evaluation criteria, and presents and discusses the results achieved by the participating systems.

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme.