A Study of Neural Matching Models for Cross-lingual IR

  title={A Study of Neural Matching Models for Cross-lingual IR},
  author={Puxuan Yu and James Allan},
  journal={Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
  • Puxuan YuJ. Allan
  • Published 26 May 2020
  • Computer Science
  • Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
In this study, we investigate interaction-based neural matching models for ad-hoc cross-lingual information retrieval (CLIR) using cross-lingual word embeddings (CLWEs). With experiments conducted on the CLEF collection over four language pairs, we evaluate and provide insight into different neural model architectures, different ways to represent query-document interactions and word-pair similarity distributions in CLIR. This study paves the way for learning an end-to-end CLIR system using… 

Figures and Tables from this paper

Cross-lingual Language Model Pretraining for Retrieval

This paper introduces two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks such as cross-lingsual ad-hoc retrieval (CLIR), and proposes to directly finetune language models on part of the evaluation collection by making Transformers capable of accepting longer sequences.

Mixed Attention Transformer for Leveraging Word-Level Knowledge to Neural Cross-Lingual Information Retrieval

A novel Mixed Attention Transformer (MAT) is proposed that incorporates external word-level knowledge, such as a dictionary or translation table into an attention matrix, and is able to focus on the mutually translated words in the input sequence.

On cross-lingual retrieval with multilingual text encoders

The results indicate that for unsupervised document-level CLIR,pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs, and point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.

C3: Continued Pretraining with Contrastive Weak Supervision for Cross Language Ad-Hoc Retrieval

This work uses comparable Wikipedia articles in different languages to further pretrain off-the-shelf multilingual pretrained models before fine-tuning on the retrieval task, and shows that this approach yields improvements in retrieval effectiveness.

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

This work presents a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs, and indicates that for unsupervised document-level CLIR – a setup in which there are no relevance judgments for task-specific fine-tuning – the pretrained encoder fail to significantly outperform models based on CLWEs.

Cross-Lingual Training with Dense Retrieval for Document Retrieval

These experiments reveal that zero-shot model-based transfer using mBERT improves the search quality in non-English mono-lingual retrieval and weakly-supervised target language transfer yields competitive performances against the generation-based targetlanguage transfer that requires external translators and query generators.

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model, which significantly outperforms strong baselines on low-resource languages, including neural machine translation.

Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval

The proposed method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space and is more efficient than the models training all languages jointly, since each language is trained individually.

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

Results on ad hoc document ranking tasks in several languages demonstrate substantial and statistically significant improvements of these trained dense retrieval models over traditional lexical CLIR baselines.

Cross-language Information Retrieval



Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

This work proposes a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all and believes that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.

Flat vs. hierarchical phrase-based translation models for cross-language information retrieval

This paper compares flat and hierarchical phrase-based translation models for query translation and finds that both approaches yield significantly better results than either a token-based or a one-best translation baseline on standard test collections.

Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

This paper proposes an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion for word translation, and shows that this approach outperforms the state of the art on word translation.

Looking inside the box: context-sensitive translation for cross-language information retrieval

This work presents a novel CLIR framework that is able to reach inside the translation "black box" and exploit these sources of evidence, including token-to-token mappings from bilingual dictionaries.

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

It is empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance, and indicates the most robust supervised and unsupervised CLE models.

A Study of MatchPyramid Models on Ad-hoc Retrieval

The MatchPyramid models can significantly outperform several recently introduced deep matching models on the retrieval task, but still cannot compete with the traditional retrieval models, such as BM25 and language models.

Word Translation Without Parallel Data

It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

End-to-End Neural Ad-hoc Ranking with Kernel Pooling

K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features, and a learning-to-rank layer that combines those features into the final ranking score.

Exploiting Representations from Statistical Machine Translation for Cross-Language Information Retrieval

This work explores how internal representations of modern statistical machine translation systems can be exploited for cross-language information retrieval and proposes two novel query translation approaches: the grammar-based approach extracts translation probabilities from translation grammars, while the decoder- based approach takes advantage of n-best translation hypotheses.

Learning to Match using Local and Distributed Representations of Text for Web Search

This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.