MTEB: Massive Text Embedding Benchmark

  title={MTEB: Massive Text Embedding Benchmark},
  author={Niklas Muennighoff and Nouamane Tazi and Loic Magne and Nils Reimers},
Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding… 



SGPT: GPT Sentence Embeddings for Semantic Search

SGPT is proposed to use decoders for sentence embeddings and semantic search via prompting or Tuning and outperforms a concurrent method with 175 billion parameters as measured on the BEIR search benchmark.

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

A new teacher-student training scheme is introduced which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting.

Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

This work provides the first exploration of sentence embeddings from text-to-text transformers (T5) including the effects of scaling up sentence encoders to 11B parameters and establishes a new sentence representation transfer benchmark, SentGLUE, which extends the SentEval toolkit to nine tasks from the GLUE benchmark.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE is presented, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings and regularizes pre-trainedembeddings’ anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.

Language-agnostic BERT Sentence Embedding

It is shown that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%, and a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba is released.

SPECTER: Document-level Representation Learning using Citation-informed Transformers

This work proposes SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document- level relatedness: the citation graph, and shows that Specter outperforms a variety of competitive baselines on the benchmark.

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

This work presents a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation, and demonstrates that the monolingual model outperforms state-of-the-art baselines in different parameter size of student models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.