Learn More
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined(More)
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query(More)
Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the(More)
We introduce a formal framework for the refactorization of complete information systems, i. e. the data model and the data itself. Within this framework model transformations are uniquely extended to the data level and result in data migrations that protects the information contained in the data. The framework is described using general and abstract notions(More)
  • 1