Johann-Christoph Freytag

Learn More
The Web of Linked Data forms a single, globally distributed dataspace. Due to the openness of this dataspace, it is not possible to know in advance all data sources that might be relevant for query answering. This openness poses a new challenge that is not addressed by traditional research on federated query processing. In this paper we present an approach(More)
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined(More)
Today's DBMSs are unable to support the increasing demands of the various applications that would like to use a DBMS. Each kind of application poses new requirements for the DBMS. The Starburst project at IBM's Almaden Research Center aims to extend relational DBMS technology to bridge this gap between applications and the DBMS. While providing a full(More)
A private information retrieval (PIR) protocol allows a user to retrieve one of N records from a database while hiding the identity of the record from the database server. With the initially proposed PIR protocols to process a query, the server has to process the entire database, resulting in an unacceptable response time for large databases. Later(More)
Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper presents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies(More)
Integrated access to information that is spread over multiple, distributed, and heterogeneous sources is an important problem in many scienti c and commercial domains. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important problem of incorporating the information quality aspect into(More)
For many information domains there are numerous World Wide Web data sources. The sources vary both in their extension and their intension: They represent different real world entities with possible overlap and provide different attributes of these entities. Mediator-based information systems allow integrated access to such sources by providing a common(More)
Complex queries often contain common or similar subexpressions, either within a single query or among multiple queries submitted as a batch. If so, query execution time can be improved by evaluating a common subexpression once and reusing the result in multiple places. However, current query optimizers do not recognize and exploit similar subexpressions,(More)
Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the(More)