Learn More
Online communities have become popular for publishing and searching content, as well as for finding and connecting to other users. User-generated content includes, for example, personal blogs, bookmarks, and digital photos. These items can be annotated and rated by different users, and these social tags and derived user-specific scores can be leveraged for(More)
Motivated by the ongoing success of Linked Data and the growing amount of semantic data sources available on the Web, new challenges to query processing are emerging. Especially in distributed settings that require joining data provided by multiple sources, sophisticated optimization techniques are necessary for efficient query processing. We propose novel(More)
The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wiki-pedia, which is a high-quality,(More)
This paper investigates how to automatically classify schema-less XML data into a user-defined topic directory. The main focus is on constructing appropriate feature spaces on which a classifier operates. In addition to the usual text-based term frequency vectors, we study XML twigs and tag paths as extended features that can be combined with text term(More)
Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin's threshold algorithm (TA). Since the user's goal behind top-k queries is to identify one or a few relevant and novel data items, it is(More)
This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but non-schematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The(More)
— Online communities have recently become a popular tool for publishing and searching content, as well as for finding and connecting to other users that share common interests. The content is typically user-generated and includes, for example, personal blogs, bookmarks, and digital photos. A particularly intriguing type of content is user-generated(More)
In addition to purely occurrence-based relevance models, term proximity has been frequently used to enhance retrieval quality of keyword-oriented retrieval systems. While there have been approaches on effective scoring functions that incorporate proximity, there has not been much work on algorithms or access methods for their efficient evaluation. This(More)
Top-<i>k</i> query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-<i>k</i> queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation(More)
The HOPI index, a connection index for XML documents based on the concept of a 2–hop cover, provides space– and time–efficient reachability tests along the ancestor, descendant , and link axes to support path expressions with wild-cards in XML search engines. This paper presents enhanced algorithms for building HOPI, shows how to augment the index with(More)