Learn More
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined(More)
Recently, quite a few query and scripting languages for Map-Reduce-based systems have been developed to ease formulating complex data analysis tasks. However, existing tools mainly provide basic operators for rather simple analyses, such as aggregating or filtering. Analytic functionality for advanced applications, such as data cleansing or information(More)
Many government organizations publish a variety of data on the web to enable transparency, foster applications, and to satisfy legal obligations. Data content, format, structure, and quality vary widely, even in cases where the data is published using the wide-spread linked data principles. Yet within this data and their integration lies much value: We(More)
Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase(More)
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These data-flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (Udfs). However, the heavy use of Udfs is(More)
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management , such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and(More)
Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the(More)
This paper introduces the SOM (Simple Object Machine) family of virtual machine (VM) implementations, a collection of VMs for the same Smalltalk dialect addressing students at different levels of expertise. Starting from a Java-based implementation, several ports of the VM to different programming languages have been developed and put to successful use in(More)
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper(More)
Duplicates in a database are one of the prime causes of poor data quality and are at the same time among the most difficult data quality problems to alleviate. To detect and remove such duplicates, many commercial and academic products and methods have been developed. The evaluation of such systems is usually in need of pre-classified results. Such gold(More)