Learn More
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel(More)
Online communities have become popular for publishing and searching content, as well as for finding and connecting to other users. User-generated content includes, for example, personal blogs, bookmarks, and digital photos. These items can be annotated and rated by different users, and these social tags and derived user-specific scores can be leveraged for(More)
The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wiki-pedia, which is a high-quality,(More)
This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but non-schematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The(More)
This paper investigates how to automatically classify schema-less XML data into a user-defined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual text-based term frequency vectors, we study XML twigs and tag paths as extended features that can be combined with text term(More)
In this paper we present HOPI, a new connection index for XML documents based on the concept of the 2–hop cover of a directed graph introduced by Cohen et al. In contrast to most of the prior work on XML indexing we consider not only paths with child or parent relationships between the nodes, but also provide space– and time–efficient reachability tests(More)
The HOPI index, a connection index for XML documents based on the concept of a 2-hop cover, provides space- and time-efficient reachability tests along the ancestor, descendant, and link axes to support path expressions with wildcards in XML search engines. This paper presents enhanced algorithms for building HOPI, shows how to augment the index with(More)
Online communities have recently become a popular tool for publishing and searching content, as well as for finding and connecting to other users that share common interests. The content is typically user-generated and includes, for example, personal blogs, bookmarks, and digital photos. A particularly intriguing type of content is user-generated(More)
In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This(More)
We consider the problem of processing K-Nearest Neighbor (KNN) queries over large datasets where the index is jointly maintained by a set of machines in a computing cluster. The proposed RankReduce approach uses locality sensitive hashing (LSH) together with a MapReduce implementation, which by design is a perfect match as the hashing principle of LSH can(More)