Learn More
Many government organizations publish a variety of data on the web to enable transparency, foster applications, and to satisfy legal obligations. Data content, format, structure, and quality vary widely, even in cases where the data is published using the wide-spread linked data principles. Yet within this data and their integration lies much value: We(More)
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined(More)
Recently, quite a few query and scripting languages for Map-Reduce-based systems have been developed to ease formulating complex data analysis tasks. However, existing tools mainly provide basic operators for rather simple analyses, such as aggregating or filtering. Analytic functionality for advanced applications, such as data cleansing or information(More)
The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management , such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and(More)
This paper introduces the SOM (Simple Object Machine) family of virtual machine (VM) implementations, a collection of VMs for the same Smalltalk dialect addressing students at different levels of expertise. Starting from a Java-based implementation, several ports of the VM to different programming languages have been developed and put to successful use in(More)
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper(More)
Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the(More)
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These data-flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (Udfs). However, the heavy use of Udfs is(More)
Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of <i>estimating</i> the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets,(More)