Fault Tolerance in MapReduce: A Survey

  title={Fault Tolerance in MapReduce: A Survey},
  author={Bunjamin Memishi and Shadi Ibrahim and Mar{\'i}a S. P{\'e}rez-Hern{\'a}ndez and Gabriel Antoniu},
  booktitle={Resource Management for Big Data Platforms},
MapReduce-based systems have emerged as a prominent framework for large-scale data analysis, having fault tolerance as one of its key features. MapReduce has introduced simple yet efficient mechanisms to handle different kinds of failures including crashes, omissions, and arbitrary failures. This contribution discusses in detail the types of failures in MapReduce systems and surveys the different mechanisms used in the framework for detecting, handling, and recovering from these failures. It… 
Analyzing fault tolerance mechanism of Hadoop Mapreduce under different type of failures
Evaluation of the performance of many representative Hadoop MapReduce applications, with different execution parameters as well as under different failure scenarios to lead a better understanding of fault tolerance mechanism of Hadoops Mapreduce despite failures.
MapReduce Data Skewness Handling: A Systematic Literature Review
In this review, it was concluded that there are important parameters have not been considered in MapReduce data skewness handling approaches.
Fault Tolerant Distributed Join Algorithm in RDBMS
A new fault tolerant join algorithm for distributed RDBMS is proposed and the results which have been already obtained and a detailed plan of further research are discussed.
A Study on Fault Tolerance Mechanisms in Cloud Computing
A comparison on the main fault tolerance techniques is presented considering the cost, overhead, failure types, performance, and the tools used and the models that enhance the performance of checkpoint and replication based techniques are studied.
Understanding the performance of erasure codes in hadoop distributed file system
This work measures and compares the performance of data accesses in HDFS under both replication and erasure coding and indicates that EC is a feasible solution for data-intensive applications and it can outperform replication in many scenarios.
Load-Balance and Fault-Tolerance for Massively Parallel Phylogenetic Inference
RAxML-ng is extended, a widely used tool to build phylogenetic trees, to mitigate hardware failures without user intervention, and the checkpointing frequency is increased, and algorithms to solve the multi-sender h-relation problem and the unilaterally-saturating 1-matching problem are presented.
ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms
This work presents an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables recovery of lost data after process failure and shows a substantial speedup of the recovery time for the fault-tolerant version of a widely used bioinformatics application.
A Selective and Incremental Backup Scheme for Task Pools
This paper suggests an uncoordinated application-level checkpointing technique for task pools that selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before.


On the Feasibility of Byzantine Fault-Tolerant MapReduce in Clouds-of-Clouds
This work presents a MapReduce runtime that tolerates arbitrary faults and runs in a set of clouds at a reasonable cost in terms of computation and execution time.
Byzantine Fault-Tolerant MapReduce: Faults are Not Just Crashes
An experimental evaluation shows that the execution of a job with the algorithm and prototype presented uses twice the resources of the original Hadoop, instead of the 3 or 4 times more that would be achieved with the direct application of common Byzantine fault-tolerance paradigms.
Performance under Failures of MapReduce Applications
A stochastic performance model is built to quantify the impact of failures on MapReduce applications and to investigate its effectiveness under different computing environments, and results show that data replication is an effective approach even when failure rate is high, and the task migration mechanism of Map Reduce works well in balancing the reliability difference among individual nodes.
On the Dynamic Shifting of the MapReduce Timeout
This book chapter investigates the problem of failure detection in the MapReduce framework and presents design ideas for a new adaptive timeout.
MapReduce Online
A modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed, and can reduce completion times and improve system utilization for batch jobs as well.
Cluster fault-tolerance: An experimental evaluation of checkpointing and MapReduce through simulation
  • T. Bressoud, M. Kozuch
  • Computer Science
    2009 IEEE International Conference on Cluster Computing and Workshops
  • 2009
A discrete event simulation driven by the LANL data and by models of parallel checkpointing and MapReduce tasks is described, with the goal of minimizing the expected running time of a parallel program in a cluster in the presence of faults for both fault tolerance models.
Towards MapReduce for Desktop Grid Computing
This paper presents the architecture of the prototype of the MapReduce programming model based on Bit Dew, a middleware for large scale data management on Desktop Grid, and describes the set of features which makes this approach suitable for large size and loosely connected Internet Desktop Grid.
Achieving Accountable MapReduce in cloud computing
Apache Hadoop YARN: yet another resource negotiator
The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Spark: Cluster Computing with Working Sets
Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.