On the Sentence Embeddings from Pre-trained Language Models

@inproceedings{Li2020OnTS,
  title={On the Sentence Embeddings from Pre-trained Language Models},
  author={Bohan Li and Hao Zhou and Junxian He and Mingxuan Wang and Yiming Yang and Lei Li},
  booktitle={EMNLP},
  year={2020}
}
Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained language models without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked language model pre-training objective and the semantic similarity task… 

Figures and Tables from this paper

Disentangling Semantics and Syntax in Sentence Embeddings with Pre-trained Language Models
TLDR
ParaBART is a semantic sentence embedding model that learns to disentangle semantics and syntax in sentence embeddings obtained by pre-trained language models, and can effectively remove syntactic information from semantic sentenceembeddings, leading to better robustness against syntactic variation on downstream semantic tasks.
Improving Contextual Representation with Gloss Regularized Pre-training
TLDR
This work proposes an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity by predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, so that the word similarity can be explicitly modeled.
Comparison and Combination of Sentence Embeddings Derived from Different Supervision Signals
TLDR
This paper focuses on two types of sentence embedding methods with similar architectures and tasks: one fine-tunes pre-trained language models on the natural language inference task, and the other fine- TSPs on word prediction task from its definition sentence, and investigates their properties.
ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
TLDR
ConSERT is presented, a Contrastive Framework for Self-Supervised SEntence Representation Transfer that adopts contrastive learning to fine-tune BERT in an unsupervised and effective way and achieves new state-of-the-art performance on STS tasks.
Whitening Sentence Representations for Better Semantics and Faster Retrieval
TLDR
The whitening operation in traditional machine learning can similarly enhance the isotropy of sentence representations and achieve competitive results, and the whitening technique is also capable of reducing the dimensionality of the sentence representation.
Sentence Bottleneck Autoencoders from Transformer Language Models
TLDR
The construction of a sentence-level autoencoder from a pretrained, frozen transformer language model that achieves better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
TransAug: Translate as Augmentation for Sentence Embeddings
TLDR
TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduces a two-stage paradigm to advances the state-ofthe-art sentence embeddings.
A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space
TLDR
This paper proposes a new method ArcCSE, with training objectives designed to enhance the pairwise discriminative power and model the entailment relation of triplet sentences, and demonstrates that this approach outperforms the previous state-of-the-art on diverse sentence related tasks, including STS and SentEval.
Positional Artefacts Propagate Through Masked Language Model Embeddings
TLDR
This work finds cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors, and introduces a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings.
Text and Code Embeddings by Contrastive Pre-Training
TLDR
It is shown that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Universal Sentence Encoder
TLDR
It is found that transfer learning using sentence embeddings tends to outperform word level transfer with surprisingly good performance with minimal amounts of supervised training data for a transfer task.
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
TLDR
It is shown how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Improving Neural Language Generation with Spectrum Control
TLDR
This paper proposes a novel spectrum control approach to directly guide the spectra training of the output embedding matrix with a slow-decaying singular value prior distribution through a reparameterization framework and demonstrates that this method outperforms the state-of-the-art Trans transformer-XL modeling for language model, and various Transformer-based models for machine translation, on common benchmark datasets for these tasks.
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
Representation Degeneration Problem in Training Natural Language Generation Models
TLDR
This work analyzes the conditions and causes of the representation degeneration problem and proposes a novel regularization method that can largely mitigate the problem and achieve better performance than baseline algorithms.
How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings
TLDR
It is found that in all layers of ELMo, BERT, and GPT-2, on average, less than 5% of the variance in a word’s contextualized representations can be explained by a static embedding for that word, providing some justification for the success of contextualization representations.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
...
...