Learn More
Many applications in Multilingual and Multi-modal Information Access involve searching large databases of high dimensional data objects with multiple (conditionally independent) views. In this work we consider the problem of learning hash functions for similarity search across the views for such applications. We propose a principled method for learning a(More)
The development of broad domain statistical machine translation systems is gated by the availability of parallel data. A promising strategy for mitigating data scarcity is to mine parallel data from comparable corpora. Although comparable corpora seldom contain parallel sentences, they often contain parallel words or phrases. Recent fragment extraction(More)
It is well known that the use of a good Machine Transliteration system improves the retrieval performance of Cross-Language Information Retrieval (CLIR) systems when the query and document languages have different orthography and phonetic alphabets. However, the effectiveness of a Machine Transliteration system in CLIR is limited by its ability to produce(More)
Topic models have great potential for helping users understand document corpora. This potential is stymied by their purely un-supervised nature, which often leads to topics that are neither entirely meaningful nor effective in extrinsic tasks (Chang et al., 2009). We propose a simple and effective way to guide topic models to learn topics of specific(More)
In this paper, we address the problem of mining transliterations of Named Entities (NEs) from large comparable corpora. We leverage the empirical fact that multilingual news articles with similar news content are rich in Named Entity Transliteration Equivalents (NETEs). Our mining algorithm, MINT, uses a cross-language document similarity model to align(More)
Although Wikipedia has emerged as a powerful collaborative Encyclopedia on the Web, it is only partially multilingual as most of the content is in English and a small number of other languages. In real-life scenarios, non-English users in general and ESL/EFL 1 users in particular, have a need to search for relevant English Wikipedia articles as no relevant(More)
In this paper we study a set of problems that are of considerable importance to Statistical Machine Translation (SMT) but which have not been addressed satisfactorily by the SMT research community. Over the last decade, a variety of SMT algorithms have been built and empirically tested whereas little is known about the computational complexity of some of(More)
Pseudo-Relevance Feedback (PRF) assumes that the top-ranking n documents of the initial retrieval are relevant and extracts expansion terms from them. In this work, we introduce the notion of pseudo-irrelevant documents, i.e. high-scoring documents outside of top n that are highly unlikely to be relevant. We show how pseudo-irrelevant documents can be used(More)