Evaluating MapReduce for Multi-core and Multiprocessor Systems

@article{Ranger2007EvaluatingMF,
  title={Evaluating MapReduce for Multi-core and Multiprocessor Systems},
  author={Colby Ranger and Ramanan Raghuraman and Arun Penmetsa and Gary R. Bradski and Christoforos E. Kozyrakis},
  journal={2007 IEEE 13th International Symposium on High Performance Computer Architecture},
  year={2007},
  pages={13-24}
}
This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime… 

Figures and Tables from this paper

Decoupled MapReduce for Shared-Memory Multi-Core Architectures

This paper enhances the traditional MapReduce architecture by decoupling the map and combine phases in order to boost parallel execution, and demonstrates that the proposed solution achieves execution speedups of up to 2.46x compared to a state-of-the-art, shared-memory Map Reduce library.

Resource-Aware MapReduce Runtime for Multi/Many-core Architectures

A novel resource-aware MapReduce architecture that decouples map and combine phases in order to enhance the parallelism degree, while it effectively overlaps the memory-intensive combine with the compute-intensive map operation resulting in superior resource utilization and performance improvements.

Optimizing MapReduce for Multicore Architectures

A new MapReduce library is introduced, Metis, with a compromise data structure designed to perform well for most workloads, and experiments with the Phoenix benchmarks show that Metis’ data structure performs better than simpler alternatives, including Phoenix.

Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system

This work optimizes Phoenix, a MapReduce runtime for shared-memory multi-cores and multiprocessors, on a quad-chip, 32-core, 256-thread UltraSPARC T2+ system with NUMA characteristics and shows how a multi-layered approach leads to significant speedup improvements with 256 threads.

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures *

An improvement of access to the hard disk in MapReduce frameworks for multi-core architectures is presented to ensure that distinct threads do not compete to take the keys processed, from the memory until.

2 . 1 MapReduce Programming Model

A new MapReduce library is introduced, Metis, with a compromise data structure designed to perform well for most workloads, and experiments with the Phoenix benchmarks show that Metis’ data structure performs better than simpler alternatives, including Phoenix.

Improved Programming-Language Independent MapReduce on Shared-Memory Systems

This paper presents XRT, a high-performance and programming-language independent MapReduce runtime for shared-memory systems taking advantage of disk-based data structures for processing datasets which cannot fit in memory.

A MapReduce Skeleton for Skandium

This project implemented the MapReduce Model for Skandium, which is Java-based algorithmic skeleton library that targets multi-core architectures and identified the main factors that affect the skeleton’s performance on shared memory architectures and when it is implemented on top of the Java platform.

Garbage collection auto-tuning for Java mapreduce on multi-cores

This paper presents MRJ, a MapReduce Java framework for multi-core architectures, and proposes the use of memory management auto-tuning techniques based on machine learning to achieve MRJ performance within 10% of optimal on 75% of the authors' benchmark tests.
...

References

SHOWING 1-10 OF 33 REFERENCES

MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

The common case transactional behavior of multithreaded programs

This work analyzes the common case transactional behavior for 35 multithreaded programs from a wide range of application domains to identify transactions within the source code by mapping existing primitives for parallelism and synchronization management to transaction boundaries.

The Jrpm system for dynamically parallelizing Java programs

Experimental results demonstrate that Jrpm can exploit thread-level parallelism with minimal effort from the programmer, and performance was achieved by automatic selection of thread decompositions by the hardware profiler, intra-procedural optimizations on code compiled dynamically into speculative threads, and some minor programmer transformations for exposing parallelism that cannot be performed automatically.

The implementation of the Cilk-5 multithreaded language

Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.

X10: an object-oriented approach to non-uniform cluster computing

A modern object-oriented programming language, X10, is designed for high performance, high productivity programming of NUCC systems and an overview of the X10 programming model and language, experience with the reference implementation, and results from some initial productivity comparisons between the X 10 and Java™ languages are presented.

Language support for lightweight transactions

This work argues that these problems can be addressed by moving to a declarative style of concurrency control in which programmers directly indicate the safety properties that they require, which is easier for mainstream programmers to use, prevents lock-based priority-inversion and deadlock problems and can offer performance advantages.

A stream compiler for communication-exposed architectures

This paper describes a fully functional compiler that parallelizes StreamIt applications for Raw, including several load-balancing transformations, and demonstrates that the StreamIt compiler can automatically map a high-level stream abstraction to Raw without losing performance.

ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors that enables recovery from a wide class of errors, including the permanent loss of an entire node.

Detailed design and evaluation of redundant multithreading alternatives

It is found that RMT can be a more significant burden for single-processor devices than prior studies indicate, and a novel application of RMT techniques in a dual-processor device, which is term chip-level redundant threading (CRT), shows higher performance than lockstepping the two cores, especially on multithreaded workloads.

Niagara: a 32-way multithreaded Sparc processor

The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.