BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

@inproceedings{Schick2020BERTRAMIW,
  title={BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance},
  author={Timo Schick and Hinrich Sch{\"u}tze},
  booktitle={ACL},
  year={2020}
}
Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare… 

Figures and Tables from this paper

Grounded Compositional Outputs for Adaptive Language Modeling
TLDR
This work proposes a fully compositional output embedding layer for language models, which is further grounded in information from a structured lexicon (WordNet), namely semantically related words and free-text definitions, and is the first word-level language model with a size that does not depend on the training vocabulary.
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation
TLDR
It is shown that for the target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains.
Overcoming Poor Word Embeddings with Word Definitions
TLDR
This work shows that examples that depend critically on a rarer word are more challenging for natural language inference models, and explores how a model could learn to use definitions, provided in natural text, to overcome this handicap.
Lacking the Embedding of a Word? Look it up into a Traditional Dictionary
TLDR
Two methods are introduced: Definition Neural Network (DefiNNet) and Define BERT (DefBERT), which significantly outperform state-of-the-art as well as baseline methods devised for producing embeddings of unknown words.
E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT
TLDR
A novel way of injecting factual knowledge about entities into the pretrained BERT model by aligning Wikipedia2Vec entity vectors with BERT’s native wordpiece vector space and using the aligned entity vectors as if they were wordpiece vectors.
Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications
TLDR
It is shown, contrary to previous studies, that the representations do not occupy a narrow cone, but rather drift in common directions, and it is shown that isotropy can be restored using a simple transformation.
Benchmarking Meta-embeddings: What Works and What Does Not
TLDR
This paper presents a new method to generate meta-embeddings, outperforming previous work on a large number of intrinsic evaluation benchmarks, and concludes that previous extrinsic evaluations of meta- embeddings have been overestimated.
Low-Resource Adaptation of Neural NLP Models
TLDR
This thesis develops and adapt neural NLP models to explore a number of research questions concerning NLP tasks with minimal or no training data and investigates methods for dealing with low-resource scenarios in information extraction and natural language understanding.
Learning Embeddings for Rare Words Leveraging Internet Search Engine and Spatial Location Relationships
TLDR
An algorithm to learn embeddings for rare words based on an Internet search engine and the spatial location relationships is proposed and can learn more accurate representations for a wider range of vocabulary.
Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey
TLDR
A survey of recent work that uses large, pre-trained transformer-based language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking
TLDR
This work adapts Attentive Mimicking, a method that was designed to explicitly learn embeddings for rare words, to deep language models and introduces one-token approximation, a procedure that enables the method to be used even when the underlying language model uses subword-based tokenization.
Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts
TLDR
Attentive mimicking is introduced: the mimicking model is given access not only to a word’s surface form, but also to all available contexts and learns to attend to the most informative and reliable contexts for computing an embedding.
Mimicking Word Embeddings using Subword RNNs
TLDR
MIMICK is presented, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributionalembeddings by performing learning at the type level of the original word embedding corpus.
Learning Semantic Representations for Novel Words: Leveraging Both Form and Context
TLDR
This paper proposes an architecture that leverages both sources of information - surface-form and context - and shows that it results in large increases in embedding quality, and can be integrated into any existing NLP system and enhance its capability to handle novel words.
Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models
TLDR
It is shown that hyperparameters that have largely been ignored in previous work can consistently improve the performance of both baseline and advanced models, achieving a new state of the art on 4 out of 6 tasks.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
TLDR
A la carte embedding is introduced, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings.
context2vec: Learning Generic Context Embedding with Bidirectional LSTM
TLDR
This work presents a neural model for efficiently learning a generic context embedding function from large corpora, using bidirectional LSTM, and suggests they could be useful in a wide variety of NLP tasks.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
...
...