• Corpus ID: 230437704

Cross-Document Language Modeling

  title={Cross-Document Language Modeling},
  author={Avi Caciularu and Arman Cohan and Iz Beltagy and Matthew E. Peters and Arie Cattan and Ido Dagan},
We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our cross-document language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which en-courages the model to learn cross-document and long-range relationships. Second, extend-ing the recent Longformer model, we pretrain with long contexts of several… 

Figures and Tables from this paper

LinkBERT: Pretraining Language Models with Document Links

This work proposes LinkBERT, an LM pretraining method that leverages links between documents that outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).

LinkBERT: Language Model Pretraining with Document Link Knowledge

LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks, outperforms BERT on diverse downstream tasks across two domains: a general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).

PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

A pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of labeled data and outperforms current state-of-the-art models on most of these settings with large margins.

Sequential Cross-Document Coreference Resolution

A new model is proposed that extends the efficient sequential prediction paradigm for coreference resolution to cross- document settings and achieves competitive results for both entity and event coreference while providing strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings.

SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts

SciCo, an expert-annotated dataset for H-CDCR in scientific papers, is created, 3X larger than the prominent ECB+ resource, and is studied to study strong baseline models that are customize for H -CDCR, and highlight challenges for future work.

Cross-document Coreference Resolution over Predicted Mentions

This work introduces the first end-to-end model for CD coreference resolution from raw text, which extends the prominent model for within-document coreference to the CD setting and achieves competitive results for event and entity coreference Resolution on gold mentions.

Focus on what matters: Applying Discourse Coherence Theory to Cross Document Coreference

This work model the entities/events in a reader’s focus as a neighborhood within a learned latent embedding space which minimizes the distance between mentions and the centroids of their gold coreference clusters, leading to a robust coreference resolution model that is now feasible to apply to downstream tasks.

Structure-inducing pre-training

A pre-training framework is introduced that enables a granular and comprehensive understanding of how relational structure can be induced and establishes a connection between the relational inductive bias of pre- training and fine-tuning performance.

Event Coreference Resolution based on Convolutional Siamese network and Circle Loss

This paper proposes a novel model by focusing on event classes with low event semantic similarity by building the Siamese network framework to enhance the feature representation, which achieves better results than other models that have not been fine-tuned on more datasets or language models.

XCoref: Cross-document Coreference Resolution in the Wild

Outperforming an established CDCR model shows that the new CDCR models need to be evaluated on semantically complex mentions with more loose coreference relations to indicate their applicability of models to resolve mentions in the “wild” of political news articles.

Multilevel Text Alignment with Cross-Document Attention

This work proposes a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component, enabling structural comparisons across different levels (document-to-document and sentence- to-document).

Pre-training via Paraphrasing

It is shown that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.

Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution

This work jointly model entity and event coreference, and proposes a neural architecture for cross-document coreference resolution using its lexical span, surrounding context, and relation to entity (event) mentions via predicate-arguments structures.

Longformer: The Long-Document Transformer

Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.

Semantic Text Matching for Long-Form Documents

This paper proposes a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long- form document based semantic text matching.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling

This work proposes a pragmatic evaluation methodology which assumes access to only raw text -- rather than assuming gold mentions, disregards singleton prediction, and addresses typical targeted settings in CD coreference resolution.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.

Hierarchical Document Encoder for Parallel Corpus Mining

The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data.

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

The proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT and is able to increase maximum input text length from 512 to 2048.