FlumeJava: easy, efficient data-parallel pipelines

@inproceedings{Chambers2010FlumeJavaEE,
  title={FlumeJava: easy, efficient data-parallel pipelines},
  author={Craig Chambers and Ashish Raniwala and Frances Perry and Stephen Adams and Robert R. Henry and Robert W. Bradshaw and Nathan Weizenbaum},
  booktitle={PLDI '10},
  year={2010}
}
MapReduce and similar systems significantly ease the task of writing data-parallel code. [] Key Method When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that…

Figures from this paper

Niijima: sound and automated computation consolidation for efficient multilingual data-parallel pipelines

TLDR
Niijima, an optimizing compiler for Microsoft's Scope/Cosmos is presented, which can consolidate C#-based user-defined operators (UDOs) across SQL statements, thereby reducing the number of dataflow vertices that require the managed runtime, and thus the amount of C# computations and the data marshalling cost.

Blaze: Simplified High Performance Cluster Computing

TLDR
Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks, is presented, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range.

Safe Data Parallelism for General Streaming

TLDR
This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing, and shows linear scalability for parallel regions that are computation-bound, and nearlinear scalability when tuples are shuffled across parallel regions.

Composable and efficient functional big data processing framework

TLDR
The Hierarchically Distributed Data Matrix is presented which is a functional, strongly-typed data representation for writing composable big data applications and a runtime framework is provided to support the execution of HDM applications on distributed infrastructures.

Auto-parallelizing stateful distributed streaming applications

TLDR
This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing, guaranteeing safety, even in the presence of stateful, selective, and user-defined operators.

HDM: A Composable Framework for Big Data Processing

TLDR
The Hierarchically Distributed Data Matrix is presented which is a functional, strongly-typed data representation for writing composable big data applications and a runtime framework is provided to support the execution, integration and management of HDM applications on distributed infrastructures.

Composable Incremental and Iterative Data-Parallel Computation with Naiad

TLDR
This paper evaluates a prototype of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation, that uses shared memory on a single multi-core computer.

PQL: A Purely-Declarative Java Extension for Parallel Programming

TLDR
This work presents an approach where parallel programming takes place in a restricted (sub-Turing-complete), logic-based declarative language, embedded in Java, that can express the parallel elements of a computing task, while regular Java code captures sequential elements.

Yedalog: Exploring Knowledge at Scale

TLDR
Yedalog is introduced, a declarative programming language that allows programmers to mix data-parallel pipelines and computation seamlessly in a single language, and extends Datalog, incorporating not only computational features from logic programming, but also features for working with data structured as nested records.

Steno: automatic optimization of declarative queries

TLDR
Steno is developed, which uses a combination of novel and well-known techniques to generate code for declarative queries that is almost as efficient as hand-optimized code.
...

References

SHOWING 1-10 OF 16 REFERENCES

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

TLDR
It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job.

Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation

  • J. Dean
  • Computer Science
    2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
  • 2006
TLDR
The basic programming model of MapReduce is described, the experience using it in a variety of domains is discussed, and the implications of programming models like Map Reduce as one paradigm to simplify development of parallel software for multi-core microprocessors are talked about.

SCOPE: easy and efficient parallel processing of massive data sets

TLDR
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.

Map-reduce-merge: simplified relational data processing on large clusters

TLDR
A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.

MapReduce: simplified data processing on large clusters

TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Pig latin: a not-so-foreign language for data processing

TLDR
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.

Interpreting the data: Parallel analysis with Sawzall

TLDR
The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.

New Ideas in Parallel Lisp: Language Design, Implementation, and Programming Tools

  • R. Halstead
  • Computer Science
    Workshop on Parallel Lisp
  • 1989
TLDR
Using new, elegant ideas in the areas of speculative computation, continuations, exception handling, aggregate data structures, and scheduling, it should be possible to build “second generation” parallel Lisp systems that are as powerful and elegantly structured as sequential Lisp systems.

C**: A Large-Grain, Object-Oriented, Data-Parallel Programming Language

C** is a new data-parallel programming language based on a new computation model called largegrain data parallelism. C** overcomes many disadvantages of existing data-parallel languages, yet retains

Bigtable: A Distributed Storage System for Structured Data

TLDR
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.