LinkBERT: Pretraining Language Models with Document Links

  title={LinkBERT: Pretraining Language Models with Document Links},
  author={Michihiro Yasunaga and Jure Leskovec and Percy Liang},
Language model (LM) pretraining captures various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then… 

Figures and Tables from this paper

Deep Bidirectional Language-Knowledge Graph Pretraining

D RAGON is proposed, a self-supervised method to pretrain a deeply joint language-knowledge foundation model from text and KG at scale and achieves strong performance on complex reasoning about language and knowledge and low-resource QA and new state-of-the-art results on various BioNLP tasks.

Structure Inducing Pre-Training

A descriptive framework for pre-training that illustrates how relational structure can be induced is introduced and demonstrates the utility of this framework through theoretical and empirical analyses showing that this approach can offer meaningful improvements over existing methods across various domains and tasks.

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

It is shown how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them, and a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.

BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining

This paper proposes BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature and evaluates it on six biomedical natural language processing tasks and demonstrates that the model outperforms previous models on most tasks.

Learning on Large-scale Text-attributed Graphs via Variational Inference

This paper proposes an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM.

Enriching Biomedical Knowledge for Low-resource Language Through Translation

A state-of-theart translation model in English-Vietnamese is made use to translate and produce both pretrained as well as supervised data in the biomedical domains, and ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus is introduced.

Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning

We present an efficient bi-encoder framework for named entity recognition (NER), which applies contrastive learning to map candidate text spans and entity types into the same vector representation

Context aware Named Entity Recognition and Relation Extraction with Domain-specific language model

Context-aware NER and RE models based on the domain-specific language model are developed and achieve the state-of-the-art performance in ChEMU 2022, the public exact match score of tasks 1a is 96.33, and task 1b is 92.82.

BioNLI: Generating a Biomedical NLI Dataset Using Lexico-semantic Constraints for Adversarial Examples

A novel semi-supervised procedure is introduced that bootstraps an NLI dataset from existing biomedical dataset that pairs mechanisms with experimental evidence in abstracts and is used to create a novel dataset for NLI in the biomedical domain, called BioNLI.

ParTNER: Paragraph Tuning for Named Entity Recognition on Clinical Cases in Spanish using mBERT + Rules

This work presents a transfer learning approach starting from multilingual BERT to tackle the problem of Spanish NER (species) and normalization in clinical cases by using sentence tokenization for training and a paragraph tuning strategy at the inference phase.



Cross-Document Language Modeling

The crossdocument language model (CD-LM) improves masked language modeling for multi-document NLP tasks with two key ideas, including pretraining with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross- document and long-range relationships.

SPECTER: Document-level Representation Learning using Citation-informed Transformers

This work proposes SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document- level relatedness: the citation graph, and shows that Specter outperforms a variety of competitive baselines on the benchmark.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

It is shown that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models.

REALM: Retrieval-Augmented Language Model Pre-Training

The effectiveness of Retrieval-Augmented Language Model pre-training (REALM) is demonstrated by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA) and is found to outperform all previous methods by a significant margin, while also providing qualitative benefits such as interpretability and modularity.

Language Models as Knowledge Bases?

An in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models finds that BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge.

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

It is shown that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels, and that HTLM is highly effective at autoprompting itself.

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

A unified model for Knowledge Embedding and Pre-trained LanguagERepresentation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs is proposed.

ERNIE: Enhanced Language Representation with Informative Entities

This paper utilizes both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE) which can take full advantage of lexical, syntactic, and knowledge information simultaneously, and is comparable with the state-of-the-art model BERT on other common NLP tasks.

CoLAKE: Contextualized Language and Knowledge Embedding

The Contextualized Language and Knowledge Embedding (CoLAKE) is proposed, which jointly learns contextualized representation for both language and knowledge with the extended MLM objective, and achieves surprisingly high performance on a synthetic task called word-knowledge graph completion, which shows the superiority of simultaneously contextualizing language andknowledge representation.

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.