Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience

  title={Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience},
  author={Alan Gates and Olga Natkovich and Shubham Chopra and Pradeep Kamath and Shravan Narayanam and Christopher Olston and Benjamin C. Reed and Santhosh Srinivasan and Utkarsh Srivastava},
  journal={Proc. VLDB Endow.},
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise… 

Figures from this paper

Compile-Time Query Optimization for Big Data Analytics
A new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time are introduced.
m2r2: A Framework for Results Materialization and Reuse in High-Level Dataflow Systems for Big Data
High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a
YSmart: Yet Another SQL-to-MapReduce Translator
Y Smart, a correlation aware SQL-to-MapReduce translator that applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query, can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators.
Representing MapReduce Optimisations in the Nested Relational Calculus
This paper argues that the Nested Relation Calculus provides a general, elegant and effective way to describe and investigate these optimizations and demonstrates that MapReduce programs can be expressed and represented straightforwardly in NRC by adding syntactic short-hands.
Compile-Time Code Generation for Embedded Data-Intensive Query Languages
A new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time is introduced.
Clydesdale: structured data processing on MapReduce
Clydesdale, a novel system for structured data processing on Hadoop -- a popular implementation of MapReduce, is introduced and it is shown that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform.
Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce
This paper describes a data warehouse system, called Cheetah, built on top of MapReduce, designed specifically for the authors' online advertising application to allow various simplifications and custom optimizations and describes a stack of optimization techniques ranging from data compression and access method to multi-query optimization and exploiting materialized views.
Studying the effect of multi-query functionality on a correlation-aware SQL-to-mapreduce translator
This project ventures into bridging the gap between Hadoop and relational databases through allowing multi-query functionality to a SQL-to-MapReduce translator and suggests that the modified translator scales linearly as the data size increases.
The family of mapreduce and large-scale data processing systems
This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.
Accumulative Computation on MapReduce
This paper proposes a new approach of using the programming pattern accumulate over MapReduce, to handle a large class of problems that cannot be simply divided into independent sub-computations.


Pig latin: a not-so-foreign language for data processing
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Generating example data for dataflow programs
This work introduces and study the problem of generating example intermediate data for dataflow programs, in a manner that illustrates the semantics of the operators while keeping the example data small, and offers techniques for dealing with these obstacles.
SCOPE: easy and efficient parallel processing of massive data sets
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.
Java support for data-intensive systems: experiences building the telegraph dataflow system
This paper highlights the pleasures of coding with Java, and some of the pains of coding around Java in order to obtain good performance in a data-intensive server, and presents concrete suggestions for evolving Java's interfaces to better suit serious software systems development.
Compiled Query Execution Engine using JVM
Both an interpreted and a compiled query execution engine are developed in a relational, Java-based, in-memory database prototype, and experimental results show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in- memory database prototype.
Interpreting the data: Parallel analysis with Sawzall
The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.
Volcano - An Extensible and Parallel Query Evaluation System
  • G. Graefe
  • Computer Science
    IEEE Trans. Knowl. Data Eng.
  • 1994
Volcano is the first implemented query execution engine that effectively combines extensibility and parallelism, and is extensible with new operators, algorithms, data types, and type-specific methods.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
This paper explains the cube and roll-up operators, shows how they fit in SQL, explains how users can define new aggregatefunctions for cubes, and discusses efficient techniques to compute the cube.
Jaql: A JSON query language
  • Jaql: A JSON query language