Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

@article{Vulic2015MonolingualAC,
  title={Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings},
  author={Ivan Vulic and Marie-Francine Moens},
  journal={Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2015}
}
  • Ivan Vulic, Marie-Francine Moens
  • Published 9 August 2015
  • Computer Science
  • Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable… 

Figures and Tables from this paper

Bilingual Distributed Word Representations from Document-Aligned Comparable Data
TLDR
It is revealed that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information.
Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
TLDR
This work proposes a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all and believes that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.
Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
TLDR
It is shown that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR, and better-quality query translations are obtained when words from more languages are used to do the clustering.
Using Word Embeddings for Query Translation for Hindi to English Cross Language Information Retrieval
TLDR
An approach based on word embeddings, a method that captures contextual clues for a particular word in the source language and gives those words as translations that occur in a similar context in the target language.
On the Role of Seed Lexicons in Learning Bilingual Word Embeddings
TLDR
Effectively, it is demonstrated that a SBWES may be induced by leveraging only a very weak bilingual signal (document alignments) along with monolingual data.
Cross-Lingual Syntactically Informed Distributed Word Representations
TLDR
Experiments with several language pairs on word similarity and bilingual lexicon induction, two fundamental semantic tasks emphasising semantic similarity, suggest the usefulness of the proposed syntactically informed cross-lingual word vector spaces.
Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations
TLDR
The model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks and can also be directly applied to another language pair without any training label.
Deep Multilabel Multilingual Document Learning for Cross-Lingual Document Retrieval
TLDR
The proposed method, MDL (deep multilabel multilingual document learning), leverages a six-layer fully connected network to project cross-lingual documents into a shared semantic space and is more efficient than the models training all languages jointly, since each language is trained individually.
Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction
TLDR
It is shown that the BWE-based BLI models significantly outperform the MuPTM-based and context-counting models in this setting, and obtain the best reported BLI results for all three tested language pairs.
...
...

References

SHOWING 1-10 OF 50 REFERENCES
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora
TLDR
The main importance of this work lies in the fact that it provides novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.
Extracting Multilingual Topics from Unaligned Comparable Corpora
TLDR
This paper presents a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus and finds that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models.
Multilingual Models for Compositional Distributed Semantics
TLDR
A novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings and demonstrates that these representations are semantically plausible and can capture semantic relationships across languages without parallel data.
Cross-lingual relevance models
TLDR
A formal model of Cross-Language Information Retrieval that does not rely on either query translation or document translation and integrates popular techniques of disambiguation and query expansion in a unified formal framework is proposed.
Linguistic Regularities in Continuous Space Word Representations
TLDR
The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.
GloVe: Global Vectors for Word Representation
TLDR
A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Inducing Crosslingual Distributed Representations of Words
TLDR
This work induces distributed representations for a pair of languages jointly and shows that these representations are informative by using them for crosslingual document classification, where classifiers trained on these representations substantially outperform strong baselines when applied to a new language.
Evaluating Neural Word Representations in Tensor-Based Compositional Settings
TLDR
In the more constrained tasks, co-occurrence vectors are competitive, although choice of compositional method is important; on the largerscale tasks, they are outperformed by neural word embeddings, which show robust, stable performance across the tasks.
Learning word embeddings efficiently with noise-contrastive estimation
TLDR
This work proposes a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation, and achieves results comparable to the best ones reported, using four times less data and more than an order of magnitude less computing time.
Word Representations: A Simple and General Method for Semi-Supervised Learning
TLDR
This work evaluates Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeds of words on both NER and chunking, and finds that each of the three word representations improves the accuracy of these baselines.
...
...