MapReduce Performance Models for Hadoop 2.x

@article{Glushkova2017MapReducePM,
  title={MapReduce Performance Models for Hadoop 2.x},
  author={Daria Glushkova and Petar Jovanovic and A. Abell{\'o}},
  journal={Inf. Syst.},
  year={2017},
  volume={79},
  pages={32-43}
}
MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that it may provide reasonably accurate job response time at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance models… Expand
Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach
TLDR
A model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster is presented, and a novel heuristic method is designed, which significantly reduces the makespan of the jobs. Expand
An Accurate and Efficient Scheduler for Hadoop MapReduce Framework
  • D. Vinutha, G. Raju
  • Computer Science
  • Indonesian Journal of Electrical Engineering and Computer Science
  • 2018
TLDR
OHMR (Optimized Hadoop MapReduce) to process data in real-time and utilize system resource efficiently and shows significant performance improvement in terms of computation time. Expand
Investigating the performance of Hadoop and Spark platforms on machine learning algorithms
TLDR
The K-nearest neighbor (KNN) algorithm is implemented on datasets with different sizes within both Hadoop and Spark frameworks, and the results show that the runtime of the KNN algorithm implemented on Spark is 4 to 4.5 times faster than Hadoops. Expand
Historical data based approach for straggler avoidance in a heterogeneous Hadoop cluster
TLDR
A Historical data based data placement (HDBDP) policy to balance the workload among heterogeneous nodes based on their computing capabilities to improve the Map tasks data locality and to reduce the job turnaround time in the heterogeneous Hadoop environment is proposed. Expand
LSTPD: Least Slack Time-Based Preemptive Deadline Constraint Scheduler for Hadoop Clusters
TLDR
A novel preemptive approach which considers the remaining execution time of the job being executed in deciding preemption and significantly reduces the job execution time and queue waiting time, compared to existing schemes. Expand
Optimized memory model for hadoop map reduce framework
TLDR
This work present a novel memory optimization model for Hadoop Map Reduce framework namely MOHMR (Optimized Hadoops Map Reduce) to process data in real-time and utilize system resource efficiently. Expand
Benchmarking and Performance Modelling of MapReduce Communication Pattern
TLDR
This work studied the low-level internals of the MapReduce communication pattern and used a minimal set of performance drivers to develop a set of phase level parametric models that can be used to infer the performance of unseen applications and approximate their performance when an arbitrary dataset is used as input. Expand
Optimising Cloud-Based Hadoop 2.x Applications
  • Naif Alasmari
  • Computer Science
  • 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion)
  • 2018
TLDR
The approach will provide those responsible for the configuration of a Hadoop application with a set of Pareto-optimal configurations, allowing them to run the application optimally within their time and/or budgetary constraints. Expand
Stable Modeling on Resource Usage Parameters of MapReduce Application
TLDR
This paper model the relationship of resource usage parameters of MapReduce applications using multiple linear regression methods and investigate the minimum sampling time for stable modeling, and proposes the approach which can be used to build stable performance model to expose the bottleneck resource of Hadoop platform. Expand
Investigating Automatic Parameter Tuning for SQL-on-Hadoop Systems
TLDR
It is revealed that the performance of Hive queries does not necessarily improve when using Hadoop-focused tuning advisors out-of-the-box, at least when following the current approach of applying the same tuning setup uniformly for evaluating the entire query. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 32 REFERENCES
Analytical Performance Models for MapReduce Workloads
TLDR
This work proposes a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints of MapReduce, and produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %. Expand
Apache Hadoop YARN: yet another resource negotiator
TLDR
The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components. Expand
Hadoop Performance Models
TLDR
A detailed set of mathematical performance models for describing the execution of a MapReduce job on Hadoop are described, which can be used to estimate the performance of Map Reduce jobs as well as to find the optimal configuration settings to use when running the jobs. Expand
ARIA: automatic resource inference and allocation for mapreduce environments
TLDR
This work designs a MapReduce performance model and implements a novel SLO-based scheduler in Hadoop that determines job ordering and the amount of resources to allocate for meeting the job deadlines and validate the approach using a set of realistic applications. Expand
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
TLDR
This paper evaluates the major architectural components in MapReduce and Spark frameworks including: shuffle, execution model, and caching, by using a set of important analytic workloads and shows that Map Reduce's execution model is more efficient for shuffling data than Spark, thus making Sort run faster on MapReduces. Expand
MapReduce: simplified data processing on large clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Expand
Multi-Resource Packing for Cluster Schedulers
Tasks in modern data-parallel clusters have highly diverse resource requirements alongCPU,memory, disk and network. WepresentTetris, amulti-resource cluster scheduler that packs tasks to machinesExpand
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2
This book is a critically needed resource for the newly released Apache Hadoop 2.0, highlighting YARN as the significant breakthrough that broadens Hadoop beyond the MapReduce paradigm. From theExpand
The Hadoop Distributed File System
TLDR
The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on. Expand
Analytic Queueing Network Models for Parallel Processing of Task Systems
TLDR
An efficient algorithm to determine the mean completion time and related performance measures for a task system: a set of tasks with precedence relationships in their execution sequence, such that the resulting graph is acyclic. Expand
...
1
2
3
4
...