Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

  title={Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models},
  author={Jianmo Ni and Gustavo Hern'andez 'Abrego and Noah Constant and Ji Ma and Keith B. Hall and Daniel Matthew Cer and Yinfei Yang},
We provide the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters. Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods to construct Sentence-T5 (ST5) models: two utilize only the T5 encoder and one using the full T5… 

Figures and Tables from this paper

TransAug: Translate as Augmentation for Sentence Embeddings

TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduces a two-stage paradigm to advances the state-ofthe-art sentence embeddings.

Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast

The proposed align sentence representations from different languages into a unified embedding space, where semantic similarities (both cross-lingual and monolingual) can be computed with a simple dot product, achieve the new state-of-the-art on several tasks.

vec2text with Round-Trip Translations

This work proposes a simple data augmentation technique based on round-trip translations and shows in extensive experiments that the result-ing vec2text model surprisingly leads to vector spaces that fulfill the authors' four desired properties and that this model strongly outperforms both standard and denoising auto-encoders.

Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters

This work further explores the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on, and develops new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings

  • Computer Science
  • 2022
This work proposes an effective knowledge distillation framework for contrastive sentence embeddings, termed DistilCSE, and proposes Contrastive Knowledge Distillation (CKD) to enhance the training objective consistencies among teacher model training, knowledge Distillation, and student model finetuning, which can improve performance like prompt learning.

Large Dual Encoders Are Generalizable Retrievers

Experimental results show that the dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform existing sparse and dense retrievers on the BEIR dataset (Thakur et al., 2021) significantly and is very data efficient, as it only needs 10% of MS Marco supervised data to achieve the best out-of-domain performance.

SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models

This paper introduces three types of negatives: in-batch negatives, pre-batch positives, and self-negatives which act as a simple form of hard negatives which can substantially outperform embedding-based methods on several benchmark datasets.

Generative Retrieval for Long Sequences

This paper uses an encoder-decoder model to memorize the target corpus in a generative manner and then uses it on query-to-passage generation, conjecture that generative retrieval is complementary to traditional retrieval, as it is conjecture that an ensemble of both outperforms homogeneous ensembles.

Saving Dense Retriever from Shortcut Dependency in Conversational Search

The existence of a retrieval shortcut in CS is demonstrated, which causes models to retrieve passages solely relying on partial history while disregarding the latest question, and iterative hard negatives mined by pre-trained dense retrievers are explored.

Dialog Inpainting: Turning Documents into Dialogs

dial inpainting takes the text of any document and transforms it into a two- person dialog between the writer and an imagined reader, and uses a dialog inpainter to predict what the imagined reader asked or said in between each of the writer's utterances.



Language-agnostic BERT Sentence Embedding

It is shown that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%, and a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba is released.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation

An easy and efficient method to extend existing sentence embedding models to new languages by using the original (monolingual) model to generate sentence embeddings for the source language and then training a new system on translated sentences to mimic the original model.

Multilingual Universal Sentence Encoder for Semantic Retrieval

On transfer learning tasks, the multilingual embeddings approach, and in some cases exceed, the performance of English only sentence embedDings.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Towards Universal Paraphrastic Sentence Embeddings

This work considers the problem of learning general-purpose, paraphrastic sentence embeddings based on supervision from the Paraphrase Database, and compares six compositional architectures, finding that the most complex architectures, such as long short-term memory (LSTM) recurrent neural networks, perform best on the in-domain data.

Rethinking embedding coupling in pre-trained language models

The analysis shows that larger output embeddings prevent the model's last layers from overspecializing to the pre-training task and encourage Transformer representations to be more general and more transferable to other tasks and languages.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

Universal Sentence Representation Learning with Conditional Masked Language Model

A multilingual CMLM model co-trained with bitext retrieval and natural language inference tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search.