Mars: A MapReduce Framework on graphics processors

  title={Mars: A MapReduce Framework on graphics processors},
  author={Bingsheng He and Wenbin Fang and Qiong Luo and Naga K. Govindaraju and Tuyong Wang},
  journal={2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)},
  • Bingsheng HeWenbin Fang Tuyong Wang
  • Published 25 October 2008
  • Computer Science
  • 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)
We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically… 

Mars: Accelerating MapReduce with Graphics Processors

The experimental results show that, the GPU-CPU coprocessing of Mars on an NVIDIA GTX280 GPU and an Intel quad-core CPU outperformed Phoenix, the state-of-the-art MapReduce on the multicore CPU with a speedup of up to 72 times and 24 times on average, depending on the applications.

StreamMR: An Optimized MapReduce Framework for AMD GPUs

This paper proposes Streamer, an OpenCL MapReduce framework optimized for AMD GPUs, with efficient atomic-free algorithms for output handling and intermediate result shuffling that is superior to atomic-based Map Reduce designs and can outperform existing atomic- free MapRed reduce implementations by nearly five-fold on an AMD Radeon HD 5870.

HeteroDoop: A MapReduce Programming System for Accelerator Clusters

Evaluation results of HeteroDoop on recent hardware indicate that usage of even a single GPU per node can improve performance by up to 2.6x, compared to a CPU-only Hadoop, running on a cluster with 20-core CPUs.

Parallel Computing Framework Based on MapReduce and GPU Clusters

Experiments have proven that the proposed parallel computing framework based on GPU cluster and MapReduce can complete the work, and it has a significant speedup for large-scale applications.

Accelerate MapReduce on GPUs with multi-level reduction

Experiments show that Jupiter can achieve up to 3x speedup over the original reduction-based GPU MapReduce framework on the applications with lots of distinct keys, and two improvements are supported in Jupiter, a multi-level reduction scheme tailored for GPU memory hierarchy and a frequency-based cache policy on key-value pairs in shared memory.

Benchmark Hadoop and Mars : MapReduce on cluster versus on GPU

This paper comparatively evaluates the performance of MapReduce model on Hadoop and on Mars, and concludes that Mars is up to two orders of magnitude faster than Hadoops, whereas Hadooper is more flexible in dependency to dataset size and shape.

Accelerating MapReduce on a coupled CPU-GPU architecture

The challenge of scaling a MapReduce application using the CPU and GPU together in an integrated architecture is focused on, and a runtime tuning method is developed to achieve very low load imbalance, while keeping scheduling overheads low.

A MapReduce Computing Framework Based on GPU Cluster

  • Heng GaoJie TangGangshan Wu
  • Computer Science
    2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing
  • 2013
A new parallel GPU programming framework based on MapReduce is designed and implemented to improve the efficiency, transparence and scalability of high performance computing on GPU clusters.

Using Shared Memory to Accelerate MapReduce on Graphics Processing Units

  • Feng JiXiaosong Ma
  • Computer Science
    2011 IEEE International Parallel & Distributed Processing Symposium
  • 2011
This work designed and implemented a GPU MapReduce framework, whose key techniques include shared memory staging area management, thread-role partitioning, and intra-block thread synchronization, and proposes a novel GPU data staging scheme for Map Reduce workloads, tailored toward the GPU memory hierarchy.

Using graphics processors for high-performance IR query processing

Preliminary experimental results suggest that significant gains in query processing performance might be obtainable with a basic system architecture for GPU-based high-performance IR, and to describe how to perform highly efficient query processing within such an architecture.



Relational joins on graphics processors

This work designs a set of data-parallel primitives such as split and sort, and uses these primitives to implement indexed or non-indexed nested-loop, sort-merge and hash joins, and utilizes the high parallelism as well as the high memory bandwidth of the GPU.

GPUTeraSort: high performance graphics co-processor sorting for large database management

Overall, the results indicate that using a GPU as a co-processor can significantly improve the performance of sorting algorithms on large databases.

Fast computation of database operations using graphics processors

New algorithms for performing fast computation of several common database operations on commodity graphics processors, taking into account some of the limitations of the programming model of current GPUs and performing no data rearrangements are presented.

Brook for GPUs: stream computing on graphics hardware

This paper presents Brook for GPUs, a system for general-purpose computation on programmable graphics hardware that abstracts and virtualizes many aspects of graphics hardware, and presents an analysis of the effectiveness of the GPU as a compute engine compared to the CPU.

A map reduce framework for programming graphics processors

The framework is built around the Map Reduce abstraction, which allows application developers to focus on their application, while enabling high performance GPU implementation, and shows the utility of the framework by implementing Support Vector Machine training as well as classification.

Evaluating MapReduce for Multi-core and Multiprocessor Systems

It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.

Accelerator: using data parallelism to program GPUs for general-purpose uses

This work describes Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead of C, and compares the performance of Accelerator versions of the benchmarks against hand-written pixel shaders.

MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Efficient gather and scatter operations on graphics processors

This paper designs multi-pass gather and scatter operations to improve their data access locality, and develops a performance model to help understand and optimize these two operations.

GPU-ABiSort: optimal parallel sorting on stream architectures

  • A. GreßG. Zachmann
  • Computer Science
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
  • 2006
This paper presents a novel approach for parallel sorting on stream processing architectures based on adaptive bitonic sorting that achieves the optimal time complexity O((n log n)/p) and presents an implementation on modern programmable graphics hardware (GPUs).