FlumeJava: easy, efficient data-parallel pipelines

@inproceedings{Chambers2010FlumeJavaEE,
  title={FlumeJava: easy, efficient data-parallel pipelines},
  author={C. Chambers and Ashish Raniwala and Frances Perry and Stephen Adams and R. Henry and Robert W. Bradshaw and Nathan Weizenbaum},
  booktitle={PLDI '10},
  year={2010}
}
MapReduce and similar systems significantly ease the task of writing data-parallel code. [...] Key Method When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that…Expand

Paper Mentions

Niijima: sound and automated computation consolidation for efficient multilingual data-parallel pipelines
TLDR
Niijima, an optimizing compiler for Microsoft's Scope/Cosmos is presented, which can consolidate C#-based user-defined operators (UDOs) across SQL statements, thereby reducing the number of dataflow vertices that require the managed runtime, and thus the amount of C# computations and the data marshalling cost. Expand
Blaze: Simplified High Performance Cluster Computing
TLDR
Blaze, a C++ library that makes it easy to develop high performance parallel programs for such compute intensive tasks, is presented, which has three main improvements over conventional MapReduce implementations: eager reduction, fast serialization, and special treatment for a small fixed key range. Expand
Safe Data Parallelism for General Streaming
TLDR
This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing, and shows linear scalability for parallel regions that are computation-bound, and nearlinear scalability when tuples are shuffled across parallel regions. Expand
Reoptimizing Data Parallel Computing
TLDR
RoPE collects certain code and data properties by piggybacking on job execution and adapts execution plans by feeding these properties to a query optimizer, and shows how this improves the future invocations of the same jobs and characterize the scenarios of benefit. Expand
Composable and efficient functional big data processing framework
TLDR
The Hierarchically Distributed Data Matrix is presented which is a functional, strongly-typed data representation for writing composable big data applications and a runtime framework is provided to support the execution of HDM applications on distributed infrastructures. Expand
Representations and Optimizations for Embedded Parallel Dataflow Languages
TLDR
It is argued that the limitations listed above are a side effect of the adopted type-based embedding approach and an alternative EDSL design based on quotations is proposed, which reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of external DSLs like SQL. Expand
iHadoop: Asynchronous Iterations for MapReduce
TLDR
Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Expand
Auto-parallelizing stateful distributed streaming applications
TLDR
This paper presents a compiler and runtime system that automatically extract data parallelism for distributed stream processing, guaranteeing safety, even in the presence of stateful, selective, and user-defined operators. Expand
HDM: A Composable Framework for Big Data Processing
TLDR
The Hierarchically Distributed Data Matrix is presented which is a functional, strongly-typed data representation for writing composable big data applications and a runtime framework is provided to support the execution, integration and management of HDM applications on distributed infrastructures. Expand
Composable Incremental and Iterative Data-Parallel Computation with Naiad
TLDR
This paper evaluates a prototype of Naiad, a set of declarative data-parallel language extensions and an associated runtime supporting efficient and composable incremental and iterative computation, that uses shared memory on a single multi-core computer. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 18 REFERENCES
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
TLDR
It is shown that excellent absolute performance can be attained--a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960- disk cluster--as well as demonstrating near-linear scaling of execution time on representative applications as the authors vary the number of computers used for a job. Expand
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation
  • J. Dean
  • Computer Science
  • 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
  • 2006
TLDR
The basic programming model of MapReduce is described, the experience using it in a variety of domains is discussed, and the implications of programming models like Map Reduce as one paradigm to simplify development of parallel software for multi-core microprocessors are talked about. Expand
SCOPE: easy and efficient parallel processing of massive data sets
TLDR
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. Expand
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices. Expand
Map-reduce-merge: simplified relational data processing on large clusters
TLDR
A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms. Expand
Pig latin: a not-so-foreign language for data processing
TLDR
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use. Expand
Interpreting the data: Parallel analysis with Sawzall
TLDR
The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines. Expand
LINQ: reconciling object, relations and XML in the .NET framework
TLDR
The .NET Language-Integrated Query (LINQ) framework, proposed for the next release of the .NET framework, approaches the problem of handling data from different data models by defining a pattern of general-purpose standard query operators for traversal, filter, and projection. Expand
New Ideas in Parallel Lisp: Language Design, Implementation, and Programming Tools
  • R. Halstead
  • Computer Science
  • Workshop on Parallel Lisp
  • 1989
TLDR
Using new, elegant ideas in the areas of speculative computation, continuations, exception handling, aggregate data structures, and scheduling, it should be possible to build “second generation” parallel Lisp systems that are as powerful and elegantly structured as sequential Lisp systems. Expand
...
1
2
...