• Corpus ID: 13882389

Hadoop Performance Tuning - A Pragmatic & Iterative Approach

@inproceedings{Heger2013HadoopPT,
  title={Hadoop Performance Tuning - A Pragmatic \& Iterative Approach},
  author={Dominique A. Heger},
  year={2013}
}
Hadoop represents a Java-based distributed computing framework that is designed to support applications that are implemented via the MapReduce programming model. In general, workload dependent Hadoop performance optimization efforts have to focus on 3 major categories: the systems HW, the systems SW, and the configuration and tuning/optimization of the Hadoop infrastructure components. From a systems HW perspective, it is paramount to balance the appropriate HW components in regards to… 

Figures and Tables from this paper

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
TLDR
This work attempts to analyze the effect of various configuration parameters on Hadoop Map-Reduce performance under various conditions, and suggests an optimal value in each case to achieve maximum throughput.
Workload Dependent Hadoop MapReduce Application Performance Modeling
In any distributed computing environment, performance optimization, job runtime predictions, or capacity and scalability quantification studies are considered as being rather complex, timeconsuming
Machine Learning-Based Configuration Parameter Tuning on Hadoop System
TLDR
This paper focuses on optimizing the Hadoop MapReduce job performance by tuning configuration parameters, and then it proposes an analytical method to help system administrators choose approximately optimal configuration parameters depending on the characteristics of each application.
ALOJA: A Framework for Benchmarking and Predictive Analytics in Hadoop Deployments
TLDR
This article presents the ALOJA project and its analytics tools, which leverages machine learning to interpret big data benchmark performance data and tuning, and provides an automated system allowing knowledge discovery by modeling environments from observed executions.
An Empirical Performance Analysis on Hadoop via Optimizing the Network Heartbeat Period
TLDR
This paper improves Hadoop’s I/O performance as well as application performance by up to 13 percent compared to the default configuration and offers a guideline that predicts the performance, costs and limitations of the total system by controlling the heartbeat period using simple equations.
Evaluating the impact of SSDs and InfiniBand in Hadoop cluster performance and costs
TLDR
Results show that as expected, both technologies can speedup Big Data processing, however, unlike commonly perceived, SSDs and InfiniBand can actually improve the cost-effectiveness of even small clusters.
Modelo para estimar performance de um Cluster Hadoop
TLDR
The simplicity and lightness of the model allows the solution be adopted how a facilitator to overcome the challenges presented by Big Data, and facilitate the use of the Hadoop, even by users with little IT experience.
HCEm model and a comparative workload analysis of Hadoop cluster
TLDR
A comparative study of HCEm using similar applications and workloads in two production Hadoop clusters, the Amazon Elastic MapReduce and a private cloud in a large financial company, is presented to evaluate the performance of the model in real and intensive processing environments.
Hadoop MapReduce Configuration Parameters and System Performance : a Systematic Review
TLDR
A systematic review by identifying current research papers, which addresses the correlation between Hadoop configuration settings and performance and identified 743 papers in 5 researched databases.
How Much Solid State Drive Can Improve the Performance of Hadoop Cluster ? Performance evaluation of Hadoop on SSD and HDD
TLDR
It is presented that external sorting algorithm in Hadoop (MapReduce) with SSD can outperform the algorithm run with hard disk and it is demonstrated that the power consumption can be drastically reduced when SSDs are used.
...
...

References

SHOWING 1-10 OF 21 REFERENCES
Optimizing Hadoop * Deployments
  • Computer Science
  • 2010
TLDR
This paper provides guidance, based on extensive lab testing conducted with Hadoop* at Intel, to organizations as they make key choices in the planning stages of Hadoops deployments, with best practices for establishing server hardware specifications, helping architects choose optimal combinations of components.
Hadoop Design , Architecture & MapReduce Performance
TLDR
Any distributed IT environment faces several, similar challenges, and an appropriate redundancy at the HW and the SW level has to be designed into the solution so that the environment meets the availability, reliability, maintainability, as well as performance goals and objectives.
Hadoop: The Definitive Guide
TLDR
This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Improving MapReduce Performance in Heterogeneous Environments
TLDR
A new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity and can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
The Google file system
TLDR
This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Efficient parallel set-similarity joins using MapReduce
TLDR
This paper proposes a 3-stage approach for end-to-end set-similarity joins in parallel using the popular MapReduce framework, and reports results from extensive experiments on real datasets to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
The Hadoop Distributed File System
TLDR
The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
MapReduce: simplified data processing on large clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Optimizing joins in a map-reduce environment
TLDR
The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.
...
...