Cross-Lingual Document Retrieval with Smooth Learning

  title={Cross-Lingual Document Retrieval with Smooth Learning},
  author={Jiapeng Liu and Xiao Zhang and Dan Goldwasser and Xiao Wang},
Cross-lingual document search is an information retrieval task in which the queries’ language and the documents’ language are different. In this paper, we study the instability of neural document search models and propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents’ languages. This framework includes a novel measure of the relevance, smooth cosine similarity, between queries and documents, and a novel loss function… 

Figures and Tables from this paper

Aspect term extraction and optimized deep fuzzy clustering-based inverted indexing for document retrieval

This paper develops a novel approach, named Exponential Aquila Optimizer (EAO)-based Deep Fuzzy Clustering for retrieving the documents that effectively finds the relevant documents and tries to understand the relationship among the documents and queries in terms of the significance of documents for query optimization.

The Geometry of Multilingual Language Models: An Equality Lens

This study analyzes the geometry of three multilingual language models in Euclidean space and finds that all languages are represented by unique geometries, and introduces a Cross-Lingual Similarity Index to measure the distance of languages with each other in the semantic space.

Topological Data Analysis of Database Representations for Information Retrieval

This work compute persistent homology on a variety of datasets and shows that some commonly used embeddings fail to preserve the connectivity and introduces the dilation-invariant bottleneck distance to capture this effect.

Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures

This work shows that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect, and provides an algorithm for their computation that exhibits greatly reduced time complexity over existing methods.

Cross-Lingual Learning-to-Rank with Shared Representations

A large-scale dataset derived from Wikipedia is introduced to support CLIR research in 25 languages and a simple yet effective neural learning-to-rank model is presented that shares representations across languages and reduces the data requirement.

Translation techniques in cross-language information retrieval

Over the last 15 years, the CLIR community has developed a wide range of techniques and models supporting free text translation, with a special emphasis on recent developments.

Cross language information retrieval

This work focuses on the development of a model for automatic Cross-Language Information Retrieval using Latent Semantic Indexing and its application to Machine Translation Technology.

Learning deep structured semantic models for web search using clickthrough data

A series of new latent semantic models with a deep structure that project queries and documents into a common low-dimensional space where the relevance of a document given a query is readily computed as the distance between them are developed.

Indexing by Latent Semantic Analysis

A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A new latent semantic model that incorporates a convolutional-pooling structure over word sequences to learn low-dimensional, semantic vector representations for search queries and Web documents is proposed.

GloVe: Global Vectors for Word Representation

A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

Query-level stability and generalization in learning to rank

The proposed theory of generalization ability of learning to rank algorithms for information retrieval (IR) is applied to the existing algorithms of Ranking SVM and IRSVM, and a number of new concepts are defined, including query-level loss, query- level risk, andquery-level stability are defined.

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Generalization error bounds for learning to rank: Does the length of document lists matter?

There is no degradation in generalization ability for several loss functions, including the cross-entropy loss used in the well known ListNet method, and novel generalization error bounds under l1 regularization and faster convergence rates if the loss function is smooth are provided.