MapReduce: Simplified Data Processing on Large Clusters

@inproceedings{Dean2004MapReduceSD,
  title={MapReduce: Simplified Data Processing on Large Clusters},
  author={Jeffrey Dean and Sanjay Ghemawat},
  booktitle={OSDI},
  year={2004}
}
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a… 
MapReduce: simplified data processing on large clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
A Survey on Big Data Analytics and MapReduce Operations on Distributed Systems
MapReduce is a programming model used for generating and processing large datasets and terabytes of data across multiple clusters. There are two functions of this model ‘Map’ and ‘Reduce’.‘Map’
Sorting Process In Mapreduce Task
TLDR
An alternate to the prevailing load-sort-store resolution which may generate a little variety of longer runs, leading to a quicker merge part, and up the performance of type among the MapReduce framework is proposed.
Parallel Processing of cluster by Map Reduce
TLDR
The author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner to achieve improved data-processing performance.
I / O Efficient Implementation of MapReduce CIS 5930
  • 2008
MapReduce is a programming model and an associated implementation used by Google for processing their massive data sets. It has a simple yet powerful interface that is amenable to a broad variety of
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
The computer industry is being challenged to develop methods and techniques for affordable data processing on large datasets at optimum response times. The technical challenges in dealing with the
The family of mapreduce and large-scale data processing systems
TLDR
This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities.
Survey of Parallel Data Processing in Context with MapReduce
TLDR
The author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner to achieve improved data-processing performance.
Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem
Cloud Computing is emerging as a new computational paradigm shift.Hadoop MapReduce has become a powerful Computation Model for processing large data on distributed commodity hardware clusters such as
THE SURVEY ON MAPREDUCE
TLDR
An overview of MapReduce programming model, its various applications and different implementations is provided, and comparisons of Hadoop and GridGain are discussed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
The Google file system
TLDR
This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
High-performance sorting on networks of workstations
We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale
Active Disks for Large-Scale Data Processing
TLDR
This work proposes using an active disk storage device that combines on-drive processing and memory with software downloadability to allow disks to execute application-level functions directly at the device.
Explicit Control in the Batch-Aware Distributed File System
We present the design, implementation, and evaluation of the Batch-Aware Distributed File System (BAD-FS), a system designed to orchestrate large, I/O-intensive batch workloads on remote computing
Map-Reduce for Machine Learning on Multicore
TLDR
This work shows that algorithms that fit the Statistical Query model can be written in a certain "summation form," which allows them to be easily parallelized on multicore computers and shows basically linear speedup with an increasing number of processors.
Charlotte: Metacomputing on the Web
TLDR
A system which enables application programmers to write parallel programs in Java and allows Java-capable browsers to execute parallel tasks is presented, which comprises a virtual machine model which isolates the program from the execution environment, and a runtime system realizing this virtual machine on the Web.
Scans as Primitive Parallel Operations
A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic
SPsort: How to Sort a Terabyte Quickly
In December 1998, a 488 node IBM RS/6000 SP sorted a terabyte of data (10 billion 100 byte records) in 17 minutes, 37 seconds. This is more than 2.5 times faster than the previous record for a
Cluster I/O with River: making the fast case common
TLDR
This work introduces River, a data-flow programming environment and I/O substrate for clusters of computers based on two simple design features: a high-performance distributed queue, and a storage redundancy mechanism called graduated declustering.
Diamond: A Storage Architecture for Early Discard in Interactive Search
TLDR
An informal user study of an image retrieval application supports the belief that early discard significantly improves the quality of interactive searches.
...
1
2
3
...