Learn More
This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of(More)
MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model(More)
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access(More)
Automatic knowledge base population from text is an important technology for a broad range of approaches to learning by reading. Effective automated knowledge base population depends critically upon coreference resolution of entities across sources. Use of a wide range of features, both those that capture evidence for entity merging and those that argue(More)
This paper describes a computational approach to resolving the true referent of a named mention of a person in the body of an email. A generative model of mention generation is used to guide mention resolution. Results on three relatively small collections indicate that the accuracy of this approach compares favorably to the best known techniques, and(More)
Test collections are the primary drivers of progress in information retrieval. They provide yardsticks for assessing the effectiveness of ranking functions in an automatic, rapid, and repeatable fashion and serve as training data for learning to rank models. However, manual construction of test collections tends to be slow, labor-intensive, and expensive.(More)
Social media platforms are a major source of information for both the general public and for journalists. Journalists use Twitter and other social media services to gather story ideas, to find eyewitnesses, and for a wide range of other purposes. One way in which journalists use Twitter is to ask questions. This paper reports on an empirical investigation(More)
This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable Map-Reduce algorithm for inverted indexing, and webpage classification to(More)
Access to historically significant email archives poses challenges that arise less often in personal collections. Most notably, searchers may need help making sense of the identities, roles, and relationships of individuals that participated in archived email exchanges. This paper describes an exploratory study of identity resolution in the public subset of(More)
In TREC 2006, teams from the University of Maryland participated in the Blog track, the Expert Search task of the Enterprise track, the Complex Interactive Question Answering task of the Question Answering track, and the Legal track. This paper reports our results. 1 Blog Track Blogs are being hailed as fundamentally different from other Internet(More)