Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

  title={Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
  author={Suraj Nair and Eugene Yang and Dawn J Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks have fallen behind these advancements. This paper introduces ColBERT-X, a generalization of the… 

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model, which significantly outperforms strong baselines on low-resource languages, including neural machine translation.

Multilingual ColBERT-X

A multilingual training procedure can enable a version of ColBERT-X well-suited for MLIR, a dense retrieval model for Cross Language Information Retrieval, to be trained on a pretrained multilingual neural language model.

Learning to Enrich Query Representation with Pseudo-Relevance Feedback for Cross-lingual Retrieval

A novel neural CLIR architecture, NCLPRF, capable of incorporating PRF feedback from multiple potentially long documents, which enables improvements to query representation in the shared semantic space between query and document languages.

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

Two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer to multilingual and cross-lingsual retrieval tasks.

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

This work uses comparable Wikipedia articles in different languages to further pretrain off-the-shelf multilingual pretrained models before fine-tuning on the retrieval task, and shows that this approach yields improvements in retrieval effectiveness.

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

The dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language, and the goal is to spur research that will improve retrieval across a continuum of languages.

MuSeCLIR: A Multiple Senses and Cross-lingual Information Retrieval Dataset

This paper addresses a deficiency in existing cross-lingual information retrieval (CLIR) datasets and provides a robust evaluation of CLIR systems’ disambiguation ability and introduces a new evaluation dataset (MuSeCLIR), which focusses on polysemous common nouns with multiple possible translations.

Cross-language Information Retrieval

HC4: A New Suite of Test Collections for Ad Hoc CLIR

HC4 is a new suite of test collections for ad hoc CrossLanguage Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document



Training Effective Neural CLIR by Bridging the Translation Gap

We introduce Smart Shuffling, a cross-lingual embedding (CLE) method that draws from statistical word alignment approaches to leverage dictionaries, producing dense representations that are

mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

The model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks and can also be directly applied to another language pair without any training label.

Cross-Lingual Training with Dense Retrieval for Document Retrieval

These experiments reveal that zero-shot model-based transfer using mBERT improves the search quality in non-English mono-lingual retrieval and weakly-supervised target language transfer yields competitive performances against the generation-based targetlanguage transfer that requires external translators and query generators.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-Shot Learning

The proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish and it is shown that augmenting the English training collection with some examples from the target language can sometimes improve performance.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Cross-Lingual Relevance Transfer for Document Retrieval

Experiments on test collections in five different languages from diverse language families show that models trained with English data improve ranking quality, without any special processing, both for (non-English) mono-lingual retrieval as well as cross-lingUAL retrieval.

Weakly Supervised Attentional Model for Low Resource Ad-hoc Cross-lingual Information Retrieval

This model relies on an attention mechanism to learn spans in the foreign sentence that are relevant to the query that achieves 19 MAP points improvement compared to using CNNs for feature extraction, 12 points improvement from machine translation-based CLIR, and up to 6 points improvement between probabilistic CLIR models.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.