Designing good MapReduce algorithms

@article{Ullman2012DesigningGM,
  title={Designing good MapReduce algorithms},
  author={Jeffrey D. Ullman},
  journal={XRDS},
  year={2012},
  volume={19},
  pages={30-34}
}
  • J. Ullman
  • Published 1 September 2012
  • Computer Science
  • XRDS
An introduction to designing algorithms for the MapReduce framework for parallel processing of big data. 

Optimizing a MapReduce module of preprocessing high-throughput DNA sequencing data

TLDR
This study focuses on performance optimization of a MapReduce application, i.e., CloudRS, which tackles on the problem of detecting and removing errors in the next-generation sequencing de novo genomic data.

MapReduce Algorithm for Single Source Shortest Path Problem

TLDR
This paper has proposed MR-DSMR, a Map reduce version of Dijkstra Strip-mined Relaxation (DSMR) algorithm and MR3-BFS algorithms, and compared the performance of both the algorithms with BFS.

A Survey on Geographically Distributed Big-Data Processing Using MapReduce

TLDR
B batch processing, stream processing, MapReduce-based systems, and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues are classified and studied.

RuleMR: Classification rule discovery with MapReduce

TLDR
Experimental evaluations indicate that the proposed algorithm, namely RuleMR, not only scales well with respect to the size of the training dataset, but also, in many cases, the resulting model is comparable to many well known algorithms in matters of accuracy.

Massive-scale processing of record-oriented and graph data

TLDR
A theoretical framework for the MapReduce system is presented, to analyze the cost of distribution for different problems domains, and for evaluating the ``goodness'' of different algorithms, and a fundamental tradeoff between the parallelism and communication costs of algorithms is identified.

Scheduling MapReduce Jobs and Data Shuffle on Unrelated Processors

TLDR
A constant approximation algorithm for generalizations of the Flexible Flow Shop FFS problem which form a realistic model for non-preemptive scheduling in MapReduce systems and improves substantially on the model proposed by Moseley et al.

A Study of Hadoop: Structure and Performance Issues

TLDR
The structure of Hadoop is studied and how its different components contribute to its performance are studied and some performance issues affectingHadoop are studied.

Logical Aspects of Massively Parallel and Distributed Systems

TLDR
The first part of the paper concerns massively parallel systems where computation proceeds in a number of synchronized rounds and the focus is on evaluation algorithms for conjunctive queries as well as on reasoning about correctness and optimization of such algorithms.

Big Data Management Challenges, Approaches, Tools and their limitations

TLDR
This chapter examines the main challenges involved in the three V's of Big Data, and provides a classification of different functions offered by NewSQL systems and discusses their benefits and limitations for processing Big Data.

References

SHOWING 1-10 OF 13 REFERENCES

MapReduce: simplified data processing on large clusters

TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Hadoop: The Definitive Guide

TLDR
This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.

Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation

TLDR
This is a vision paper that attempts to answer the questions described above about the ease of "map-reducability" - whether the problem can be partitioned into independent pieces, which are distributed across mappers/reducers.

Fuzzy Joins Using MapReduce

TLDR
It is found that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered.

Map-reduce extensions and recursive queries

TLDR
This work proposes several algorithmic ideas for efficient implementation of recursions in the map-reduce environment and discusses several alternatives for supporting recovery from failures without restarting the entire job.

Counting triangles and the curse of the last reducer

TLDR
This work describes a sequential triangle counting algorithm and shows how to adapt it to the MapReduce setting, and presents a new algorithm designed specifically for the Map Reduce framework that achieves a factor of 10-100 speed up over the naive approach.

SkewTune: mitigating skew in mapreduce applications

TLDR
The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.

Mining of Massive Datasets

TLDR
Determining relevant data is key to delivering value from massive amounts of data and big data is defined less by volume which is a constantly moving target than by its ever-increasing variety, velocity, variability and complexity.

Enumerating subgraph instances using map-reduce

TLDR
This paper exploits the techniques of [1] for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round for the simplest sample graph, the triangle, and addresses the matter of optimizing computation cost.

ON THE NUMBER OF SUBGRAPHS OF PRESCRIBED TYPE OF GRAPHS WITH A GIVEN NUMBER OF EDGES*

All graphs considered are finite, undirected, with no loops, no multiple edges and no isolated vertices. For a graph H=(V(H),E(H)) and for S C V(H) define N(S) = {x ~ V(H):xy E E(H) for some y E S}.