MapReduce: a flexible data processing tool

  title={MapReduce: a flexible data processing tool},
  author={Jeffrey Dean and Sanjay Ghemawat},
  journal={Commun. ACM},
MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs. 
Generalized Parallel Join Algorithms and Designing Cost Models
The aim of this work is to compare the different join algorithms and designing cost models for further use in the query optimizer.
Comparative study parallel join algorithms for MapReduce environment
The aim of this work is to generalize and compare existing equi-join algorithms with some optimization techniques, and to focus in a MapReduce environment.
Applying MapReduce Programming Model for Handling Scientific Problems
  • Yun-hee Kang, Y. B. Park
  • Computer Science
    2014 International Conference on Information Science & Applications (ICISA)
  • 2014
The Hadoop application with map and reduce functions for the data transformation is shown, showing the need to efficiently provide resources for handling diverse MapReduce applications.
Evaluating data storage structures of MapReduce
The experimental results show that RCFile data storage structure can achieve better performance in most cases than row-store, column- store, and RCFile in MapReduce.
Combining Stream Processing Engines and Big Data Storages for Data Analysis
We propose a system combining stream processing engines and big data storages for analyzing large amounts of data streams. It allows us to analyze data online and to store data for later offline
Data-Intensive Text Processing with MapReduce
This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model [1], using the open-source Hadoop implementation. The focus will be on
Astronomical Data Application Research Based on MapReduce
  • Qingfa Cui, Sheng-Chuan Wu
  • Computer Science
    2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)
  • 2018
In the construction of the experimental platform, this paper successfully designs and implements the cone search service based on MapReduce, and proves that the astronomical data application method based onMapReduce greatly improves the processing capacity.
Cogset: a high performance MapReduce engine
Cogset's architecture is presented and its performance as a MapReduce engine is evaluated, comparing it with Hadoop, showing that Cogset generally outperforms Hadoops by a significant margin.
Towards improved load balancing for data intensive distributed computing
Techniques for improving load balancing -- particularly multi-stage jobs and dynamic partition assignment -- by using a modified programming model that offers greater flexibility but maintains the simplicity, scalability and fault tolerance of MapReduce are introduced.


Bigtable: A Distributed Storage System for Structured Data
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
This paper explores the feasibility of building a hybrid system that takes the best features from both technologies; the prototype built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.
A comparison of approaches to large-scale data analysis
A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
Pig latin: a not-so-foreign language for data processing
A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Interpreting the data: Parallel analysis with Sawzall
The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.
Simplified data processing on large clusters Usenix Association
  • Proceedings of the Sixth Symposium on Operating System Design and Implementation
  • 2004
and leung, s.-t. the Google file system
  • Proceedings of the 19th ACM Symposium on Operating Systems Principles (lake George, ny, oct. 19–22). aCM Press, new york
  • 2003
Greenplum Mapreduce: bringing next-Generation analytics technology to the enterprise
Hadoop. documentation and open source release