An event-centric model for multilingual document similarity

Abstract

Document similarity measures play an important role in many document retrieval and exploration tasks. Over the past decades, several models and techniques have been developed to determine a ranked list of documents similar to a given query document. Interestingly, the proposed approaches typically rely on extensions to the vector space model and are rarely suited for multilingual corpora. In this paper, we present a novel document similarity measure that is based on events extracted from documents. An event is solely described by nearby occurrences of temporal and geographic expressions in a document's text. Thus, a document is modeled as a set of events that can be compared and ranked using temporal and geographic hierarchies. A key feature of our model is that it is term- and language-independent as temporal and geographic expressions mentioned in texts are normalized to a standard format. This also allows to determine similar documents across languages, an important feature in the context of document exploration. Our approach proves to be quite effective, including the discovery of new similarities, as our experiments using different (multilingual) corpora demonstrate.

DOI: 10.1145/2009916.2010043

Extracted Key Phrases

7 Figures and Tables

Cite this paper

@inproceedings{Strtgen2011AnEM, title={An event-centric model for multilingual document similarity}, author={Jannik Str{\"{o}tgen and Michael Gertz and Conny Junghans}, booktitle={SIGIR}, year={2011} }