A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models

  title={A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models},
  author={Iurii Mokrii and Leonid Boytsov and Pavel Braslavski},
  journal={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot… 

Figures and Tables from this paper

Noise-Reduction for Automatically Transferred Relevance Judgments

This work compares the predicted relevance probabilities of monoT5 for the two versions of the judged documents and finds substantial differences, and shows that training a retrieval model on the "wrong" version can reduce the nDCG@10 by up to 75%.

How Train-Test Leakage Affects Zero-shot Retrieval

This paper investigates the impact of this unintended train–test leakage by training neural models on MS MARCO document ranking data with different proportions of controlled leakage to Robust04 and the TREC 2017 and 2018 Common Core tracks as test datasets.

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

A thorough structured overview of mainstream techniques for low-resource DR, dividing the techniques into three main categories based on their required resources, and highlighting the open issues and pros and cons.

Professionalism and clinical short answer question marking with machine learning

Machine learning may assist in medical student evaluation. This study involved scoring short answer questions administered at three centres. Bidirectional encoder representations from transformers

InPars: Unsupervised Dataset Generation for Information Retrieval

This work harnesses the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks and shows that models finetuned solely on these synthetic datasets outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods.

Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding

A comprehensive evaluation of 13 recent models for ranking of long documents using two popular collections (MS MARCO documents and Robust04) finds the simple FirstP baseline (truncating documents to satisfy the input-sequence constraint of a typical Transformer model) to be quite effective.

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

This work builds on SPLADE -- a sparse expansion-based retriever -- and shows to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization.

An Empirical Study on Transfer Learning for Privilege Review

This paper examines the effectiveness of transfer learning in privilege model on three real world datasets with privilege labels and shows that BERT model outperforms the industry standard logistic regression algorithm and transfer learning models can achieve decent performance on datasets in same or close domains.

Sequential Attention Module for Natural Language Processing

This paper proposes a simple yet effective plug-and-play module, Sequential Attention Module (SAM), on the token embeddings learned from a pre-trained language model, and demonstrates that SAM consistently outperforms the state-of-the-art baselines.



A Little Bit Is Worse Than None: Ranking with Limited Training Data

This work explores zero-shot ranking using BERT models that have already been fine-tuned with the large MS MARCO passage retrieval dataset and arrives at the surprising and novel finding that “some” labeled in-domain data can be worse than none at all.

Natural Questions: A Benchmark for Question Answering Research

The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

Neural Ranking Models with Weak Supervision

This paper proposes to train a neural ranking model using weak supervision, where labels are obtained automatically without human annotators or any external resources, and suggests that supervised neural ranking models can greatly benefit from pre-training on large amounts of weakly labeled data that can be easily obtained from unsupervised IR models.

Understanding inverse document frequency: on theoretical arguments for IDF

It is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

This work extensively analyzes different retrieval models and provides several suggestions that it believes may be useful for future work, finding that performing well consistently across all datasets is challenging.

Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits

It is shown that adding an interpretable neural Model 1 layer on top of BERT-based contextualized embeddings does not decrease accuracy and/or efficiency; and may overcome the limitation on the maximum sequence length of existing BERT models.

Cross-domain Retrieval in the Legal and Patent Domains: a Reproducability Study

It is found that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level.

Flexible retrieval with NMSLIB and FlexNeuART

NMSLIB is introduced to the NLP community NMSLIB, a new retrieval toolkit FlexNeuART is described, which can efficiently retrieve mixed dense and sparse representations (with weights learned from training data), which is achieved by extending NMS LIB.

Pretrained Transformers for Text Ranking: BERT and Beyond

This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.