MapReduce: simplified data processing on large clusters

  title={MapReduce: simplified data processing on large clusters},
  author={Muthu Dayalan},
  journal={Commun. ACM},
  • Muthu Dayalan
  • Published 6 December 2004
  • Computer Science
  • Commun. ACM
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the… 

Figures and Tables from this paper

Use of MapReduce in Distributed Systems

MapReduce is a programming model or software framework which is associated with the implementation of generating large data sets and their processing to a broad variety of real world task.

Distributed Programming with MapReduce

MapReduce was developed as a way of simplifying the development of large-scale computations at Google and allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Simplifying MapReduce Data Processing

This paper develops a web-based graphic user interface for ordinary users to utilize MapReduce without the real programming, where users only have to know how to specify their tasks in target-value-action tuples.

I / O Efficient Implementation of MapReduce CIS 5930

  • Computer Science
  • 2008
In this project, the problem of how to implement the MapReduce interface efficiently on a single machine is studied, and techniques that you have learned from class will be useful (and required!).

MapReduce : Distributed Computing for Machine Learning

It is concluded that MapReduce is a good choice for basic operations on large datasets, although there are complications to be addressed for more complex machine learning tasks.

General-Purpose Big Data Processing Systems

  • S. Sakr
  • Computer Science
    Big Data 2.0 Processing Systems
  • 2020
In 2004, Google introduced the MapReduce framework as a simple and powerful programming model that enables the easy development of scalable parallel applications to process vast amounts of data on


An overview of MapReduce programming model, its various applications and different implementations is provided, and comparisons of Hadoop and GridGain are discussed.

Minimal MapReduce algorithms

The notion of minimal algorithm is presented, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor.

MapReduce Programming Model for . NET-based Distributed Computing

This technical report presents a realization of MapReduce for .NET-based data centers, including the programming model and runtime system, and its performance evaluation.

Simplified Data Processing for Large Cluster: A MapReduce and Hadoop Based Study

The implication of MapReduce and Hadoop framework is aimed at discussing terabytes and petabytes of storage with thousands of machines parallel to every machine and process at identical times, so that large processing and manipulation of big data are maintained with effective result orientations.



Evaluating MapReduce for Multi-core and Multiprocessor Systems

It is established that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code.

The Google file system

This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

Web Search for a Planet: The Google Cluster Architecture

Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

Active Disks for Large-Scale Data Processing

This work proposes using an active disk storage device that combines on-drive processing and memory with software downloadability to allow disks to execute application-level functions directly at the device.

High-performance sorting on networks of workstations

We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale

Map-Reduce for Machine Learning on Multicore

This work shows that algorithms that fit the Statistical Query model can be written in a certain "summation form," which allows them to be easily parallelized on multicore computers and shows basically linear speedup with an increasing number of processors.

Charlotte: Metacomputing on the Web

Explicit Control in the Batch-Aware Distributed File System

We present the design, implementation, and evaluation of the Batch-Aware Distributed File System (BAD-FS), a system designed to orchestrate large, I/O-intensive batch workloads on remote computing

SPsort: How to Sort a Terabyte Quickly

In December 1998, a 488 node IBM RS/6000 SP sorted a terabyte of data (10 billion 100 byte records) in 17 minutes, 37 seconds. This is more than 2.5 times faster than the previous record for a

Cluster I/O with River: making the fast case common

This work introduces River, a data-flow programming environment and I/O substrate for clusters of computers based on two simple design features: a high-performance distributed queue, and a storage redundancy mechanism called graduated declustering.