Auto Tuning of Hadoop and Spark parameters

@article{Patanshetti2021AutoTO,
  title={Auto Tuning of Hadoop and Spark parameters},
  author={Tanuja Patanshetti and Ashish Anil Pawar and Disha Patel and Sanket Thakare},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.02604}
}
Data of the order of terabytes, petabytes, or beyond is known as Big Data. This data cannot be processed using the traditional database software, and hence there comes the need for Big Data Platforms. By combining the capabilities and features of various big data applications and utilities, Big Data Platforms form a single solution. It is a platform that helps to develop, deploy and manage the big data environment. Hadoop and Spark are the two open-source Big Data Platforms provided by Apache… 
1 Citations

A Novel Reinforcement Learning Approach for Spark Configuration Parameter Optimization

TLDR
A reinforcement-learning-based Spark configuration parameter optimizer is designed and implemented that could efficiently find the better configuration parameters and improve the performance of various Spark applications.

References

SHOWING 1-10 OF 12 REFERENCES

SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing

TLDR
This paper proposes and developed an effective, self-tuning approach, namely SMBSP, based on Artificial Neural Network (ANN) to avoid the drawbacks of manual tuning of parameters in Hadoop-Spark system and is found to speed up the performance of the Spark system by 35% compared with default parameter configuration.

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

TLDR
This work attempts to analyze the effect of various configuration parameters on Hadoop Map-Reduce performance under various conditions, and suggests an optimal value in each case to achieve maximum throughput.

Speed-up Extension to Hadoop System

TLDR
Apaches Hadoop is open source implementation of Google Map/Reduce framework, it enables data intensive, distributed and parallel applications by diving massive job into smaller tasks and massive data sets into smaller partition such that each task processes a different partition in parallel.

Towards Machine Learning-Based Auto-tuning of MapReduce

TLDR
This paper evaluates several machine learning models with diverse MapReduce applications and cluster configurations, and shows that support vector regression model (SVR) has good accuracy and is also computationally efficient, and proposes and discusses a complete and practical end-to-end auto-tuning flow.

Otterman: A Novel Approach of Spark Auto-tuning by a Hybrid Strategy

TLDR
Otterman, a parameters optimization approach based on the combination of Simulated Annealing algorithm and Least Squares method, which can help to dynamically adjust parameters according to job types to obtain optimal configuration to improve performance.

JellyFish: Online Performance Tuning with Adaptive Configuration and Elastic Container in Hadoop Yarn

  • Xiaoan DingYi LiuD. Qian
  • Computer Science
    2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)
  • 2015
TLDR
Experimental results show that JellyFish can improve performance of MapReduce jobs by an average of 24% for jobs run for the first time, and by anAverage of 65% for Jobs run multiple times compared to default YARN.

Auto-Tuning Spark Configurations Based on Neural Network

TLDR
A neural network based configuration tuning approach is proposed which is trained to predict the increase or decrease of configurations which determines the next search space and outperforms over related approaches with optimal configuration and less search time.

Towards Automatic Tuning of Apache Spark Configuration

TLDR
This paper investigates machine learning based approaches to construct application specific performance influence models, and uses them to tune the performance of specific applications running on Apache Spark platform.

Random Search for Hyper-Parameter Optimization

TLDR
This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid, and shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper- parameter optimization algorithms.

Bayesian Optimization in a Billion Dimensions via Random Embeddings

TLDR
Empirical results confirm that REMBO can effectively solve problems with billions of dimensions, provided the intrinsic dimensionality is low, and show thatREMBO achieves state-of-the-art performance in optimizing the 47 discrete parameters of a popular mixed integer linear programming solver.