Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

  title={Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations},
  author={Rui Zhang and Caitlin Westerfield and Sungrok Shim and G. Bingham and Alexander R. Fabbri and Neha Verma and William Hu and Dragomir R. Radev},
In this paper, we propose to boost low-resource cross-lingual document retrieval performance with deep bilingual query-document representations. We match queries and documents in both source and target languages with four components, each of which is implemented as a term interaction-based deep neural network with cross-lingual word embeddings as input. By including query likelihood scores as extra features, our model effectively learns to rerank the retrieved documents by using a small number… Expand
Cross-Lingual Training with Dense Retrieval for Document Retrieval
  • Peng Shi, Rui Zhang, He Bai, Jimmy Lin
  • Computer Science
  • 2021
Dense retrieval has shown great success in passage ranking in English. However, its effectiveness in document retrieval for non-English languages remains unexplored due to the limitation in trainingExpand
Cross-language Sentence Selection via Data Augmentation and Rationale Training
This paper uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model that performs as well as or better than multiple state-of-theart machine translation + monolingual retrieval systems trained on the same parallel data. Expand
Combining Contextualized and Non-contextualized Query Translations to Improve CLIR
Evidence is presented that combining such context-dependent translation probabilities with context-independent translation probabilities learned from the same parallel corpus can yield improvements in the effectiveness of cross-language ranked retrieval. Expand
CLIRMatrix: A Massively Large Collection of Bilingual and Multilingual Datasets for Cross-Lingual Information Retrieval
This work presents CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia, intended to support research in end-to-end neural information retrieval. Expand
Cross-Lingual Transfer Learning for Complex Word Identification
This work aims to provide evidence that the proposed models can learn the characteristics of complex words in a multilingual environment by relying on the CWI shared task 2018 dataset available for four different languages (i.e., English, German, Spanish, and also French). Expand
Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce
A new task of cross-lingual set-to-description retrieval in cross-border e-commerce, which involves matching product attribute sets in the source language with persuasive product descriptions in the target language is explored, and a novel cross- Lingual matching network (CLMN) is proposed with the enhancement of context-dependent cross-lingsual mapping upon the pre-trained monolingual BERT representations. Expand
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios
A structured overview of methods that enable learning when training data is sparse including mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision are given. Expand
XOR QA: Cross-lingual Open-Retrieval Question Answering
This work constructs a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for and introduces a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingually document retrieval from multilingual and English resources. Expand


Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
It is shown in an experimental evaluation on patent prior art search that the approach, and in particular a consensus-based combination of boosting and translation-based approaches, yields substantial improvements in CLIR performance. Expand
Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
This work proposes a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all and believes that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent. Expand
Cross-Lingual Learning-to-Rank with Shared Representations
A large-scale dataset derived from Wikipedia is introduced to support CLIR research in 25 languages and a simple yet effective neural learning-to-rank model is presented that shares representations across languages and reduces the data requirement. Expand
Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
A novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) is presented which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data. Expand
Learning Translational and Knowledge-based Similarities from Relevance Rankings for Cross-Language Retrieval
An approach to cross-language retrieval that combines dense knowledgebased features and sparse word translations that is learned directly from relevance rankings of bilingual documents in a pairwise ranking framework is presented. Expand
Learning to rank with (a lot of) word features
This article defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Expand
Querying across languages: a dictionary-based approach to multilingual information retrieval
Using translated queries and a bilingual transfer dictionary, it is learned that crosslartguage multilingual IR is feasible, although performance lags considerably behind the monolingual standard. Expand
Word Translation Without Parallel Data
It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Expand
A Dual Embedding Space Model for Document Ranking
The proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches, and shows that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. Expand
Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks
This paper presents a convolutional neural network architecture for reranking pairs of short texts, where the optimal representation of text pairs and a similarity function to relate them in a supervised way from the available training data are learned. Expand