Learn More
We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere’s features include “in situ” data processing, a declarative query language, treatment of user-defined(More)
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query(More)
Recently, quite a few query and scripting languages for Map-Reduce-based systems have been developed to ease formulating complex data analysis tasks. However, existing tools mainly provide basic operators for rather simple analyses, such as aggregating or filtering. Analytic functionality for advanced applications, such as data cleansing or information(More)
Recent years have seen an increased interest in large-scale analytical dataflows on non-relational data. These data-flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such dataflows are user-defined predicates or functions (Udfs). However, the heavy use of Udfs is(More)
String similarity search is required by many real-life applications , such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is(More)
We present an approach for extracting molecular events from literature based on a deep parser, using in a query language for parse trees. Detected events range from gene expression to protein localization, and cover a multitude of different entity types, including genes/proteins, binding sites, and locations. Furthermore, our approach is capable of(More)
Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing , or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations(More)
Since the early days of the Human Genome Project, data management has been recognized as a key challenge for modern molecular biology research. By the end of the nineties, technologies had been established that adequately supported most ongoing projects, typically built upon relational database management systems. However, recent years have seen a dramatic(More)
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper(More)
Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. One strategy to speed up similarity-based queries is parallelization on clusters using MapReduce. However, distributing data over a cluster also incurs high cost. At(More)