Cross-Document Language Modeling
@article{Caciularu2021CrossDocumentLM, title={Cross-Document Language Modeling}, author={Avi Caciularu and Arman Cohan and Iz Beltagy and Matthew E. Peters and Arie Cattan and Ido Dagan}, journal={ArXiv}, year={2021}, volume={abs/2101.00406} }
We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our cross-document language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which en-courages the model to learn cross-document and long-range relationships. Second, extend-ing the recent Longformer model, we pretrain with long contexts of several…
No Paper Link Available
Figures and Tables from this paper
15 Citations
LinkBERT: Pretraining Language Models with Document Links
- 2022
Computer Science
ACL
This work proposes LinkBERT, an LM pretraining method that leverages links between documents that outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).
LinkBERT: Language Model Pretraining with Document Link Knowledge
- 2022
Computer Science
LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks, outperforms BERT on diverse downstream tasks across two domains: a general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain ( pretrained on PubMed with citation links).
PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
- 2021
Computer Science
ArXiv
A pre-trained model for multi-document representation with focus on summarization that reduces the need for dataset-specific architectures and large amounts of labeled data and outperforms current state-of-the-art models on most of these settings with large margins.
Sequential Cross-Document Coreference Resolution
- 2021
Computer Science
EMNLP
A new model is proposed that extends the efficient sequential prediction paradigm for coreference resolution to cross- document settings and achieves competitive results for both entity and event coreference while providing strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings.
SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts
- 2021
Computer Science
AKBC
SciCo, an expert-annotated dataset for H-CDCR in scientific papers, is created, 3X larger than the prominent ECB+ resource, and is studied to study strong baseline models that are customize for H -CDCR, and highlight challenges for future work.
Cross-document Coreference Resolution over Predicted Mentions
- 2021
Computer Science
FINDINGS
This work introduces the first end-to-end model for CD coreference resolution from raw text, which extends the prominent model for within-document coreference to the CD setting and achieves competitive results for event and entity coreference Resolution on gold mentions.
Focus on what matters: Applying Discourse Coherence Theory to Cross Document Coreference
- 2021
Computer Science
EMNLP
This work model the entities/events in a reader’s focus as a neighborhood within a learned latent embedding space which minimizes the distance between mentions and the centroids of their gold coreference clusters, leading to a robust coreference resolution model that is now feasible to apply to downstream tasks.
Structure-inducing pre-training
- 2023
Computer Science
Nature Machine Intelligence
A pre-training framework is introduced that enables a granular and comprehensive understanding of how relational structure can be induced and establishes a connection between the relational inductive bias of pre- training and fine-tuning performance.
Event Coreference Resolution based on Convolutional Siamese network and Circle Loss
- 2022
Computer Science
2022 International Joint Conference on Neural Networks (IJCNN)
This paper proposes a novel model by focusing on event classes with low event semantic similarity by building the Siamese network framework to enhance the feature representation, which achieves better results than other models that have not been fine-tuned on more datasets or language models.
XCoref: Cross-document Coreference Resolution in the Wild
- 2022
Computer Science
iConference
Outperforming an established CDCR model shows that the new CDCR models need to be evaluated on semantically complex mentions with more loose coreference relations to indicate their applicability of models to resolve mentions in the “wild” of political news articles.
43 References
Multilevel Text Alignment with Cross-Document Attention
- 2020
Computer Science
EMNLP
This work proposes a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component, enabling structural comparisons across different levels (document-to-document and sentence- to-document).
Pre-training via Paraphrasing
- 2020
Computer Science
NeurIPS
It is shown that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.
Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution
- 2019
Computer Science
ACL
This work jointly model entity and event coreference, and proposes a neural architecture for cross-document coreference resolution using its lexical span, surrounding context, and relation to entity (event) mentions via predicate-arguments structures.
Longformer: The Long-Document Transformer
- 2020
Computer Science
ArXiv
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
Semantic Text Matching for Long-Form Documents
- 2019
Computer Science
WWW
This paper proposes a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long- form document based semantic text matching.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
- 2018
Computer Science
BlackboxNLP@EMNLP
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Streamlining Cross-Document Coreference Resolution: Evaluation and Modeling
- 2020
Computer Science
ArXiv
This work proposes a pragmatic evaluation methodology which assumes access to only raw text -- rather than assuming gold mentions, disregards singleton prediction, and addresses typical targeted settings in CD coreference resolution.
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
- 2020
Computer Science
ICML
This work proposes pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective, PEGASUS, and demonstrates it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores.
Hierarchical Document Encoder for Parallel Corpus Mining
- 2019
Computer Science
WMT
The results show document embeddings derived from sentence-level averaging are surprisingly effective for clean datasets, but suggest models trained hierarchically at the document-level are more effective on noisy data.
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
- 2020
Computer Science
CIKM
The proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT and is able to increase maximum input text length from 512 to 2048.