An Inspection of the Reproducibility and Replicability of TCT-ColBERT

  title={An Inspection of the Reproducibility and Replicability of TCT-ColBERT},
  author={Xiao Wang and Sean MacAvaney and Craig Macdonald and Iadh Ounis},
  journal={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • Xiao WangSean MacAvaney I. Ounis
  • Published 6 July 2022
  • Computer Science
  • Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Dense retrieval approaches are of increasing interest because they can better capture contextualised similarity compared to sparse retrieval models such as BM25. Among the most prominent of these approaches is TCT-ColBERT, which trains a light-weight "student'' model from a more expensive "teacher'' model. In this work, we take a closer look into TCT-ColBERT concerning its reproducibility and replicability. To structure our study, we propose a three-stage perspective on reproducing the training… 

Figures and Tables from this paper

Doc2Query-: When Less is More

It is found that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%.

Adaptive Re-Ranking with a Corpus Graph

The Graph-based Adaptive Re-ranking (GAR) approach significantly improves the performance of re-ranking pipelines in terms of precision- and recall-oriented measures, is complementary to a variety of existing techniques, is robust to its hyperparameters, and contributes minimally to computational and storage costs.



ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Maize is introduced, a retriever that couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction and establishes state-of-the-art quality within and outside the training domain.

On Single and Multiple Representations in Dense Passage Retrieval

It is observed that, while ANCE is more efficient than ColBERT in terms of response time and memory usage, multiple representations are statistically more effective than the single representations for MAP and MRR@10.

Expansion via Prediction of Importance with Contextualization

A representation-based ranking approach that explicitly models the importance of each term using a contextualized language model, and performs passage expansion by propagating the importance to similar terms, which narrows the gap between inexpensive and cost-prohibitive passage ranking approaches.

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

An efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColberT’s expressive MaxSim operator into a simple dot product.

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

This work proposes a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT ranking architectures, and shows that across evaluated architectures it significantly improves their effectiveness without compromising their efficiency.

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

ColBERT is presented, a novel ranking model that adapts deep LMs (in particular, BERT) for efficient retrieval that is competitive with existing BERT-based models (and outperforms every non-BERT baseline) and enables leveraging vector-similarity indexes for end-to-end retrieval directly from millions of documents.

From doc2query to docTTTTTquery

The setup in this work follows doc2query, but with T5 as the expansion model, and it is found that the top-k sampling decoder produces more effective queries than beam search.

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

This work distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product, thus enabling single-step ANN search and improves query latency and greatly reduces the onerous storage requirements of ColberT.

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

This work proposes an optimized training approach, called RocketQA, to improving dense passage retrieval, which significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions and demonstrates that the performance of end-to-end QA can be improved based on theRocketQA retriever.

Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

This work conducts the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback, and extracts representative feedback embeddings that are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation.