Auto Tuning of Hadoop and Spark parameters

@article{Patanshetti2021AutoTO,
  title={Auto Tuning of Hadoop and Spark parameters},
  author={Tanuja Patanshetti and Ashish Anil Pawar and Disha Patel and Sanket Thakare},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.02604}
}
Data of the order of terabytes, petabytes, or beyond is known as Big Data. This data cannot be processed using the traditional database software, and hence there comes the need for Big Data Platforms. By combining the capabilities and features of various big data applications and utilities, Big Data Platforms form a single solution. It is a platform that helps to develop, deploy and manage the big data environment. Hadoop and Spark are the two open-source Big Data Platforms provided by Apache… 

References

SHOWING 1-10 OF 12 REFERENCES
SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing
TLDR
This paper proposes and developed an effective, self-tuning approach, namely SMBSP, based on Artificial Neural Network (ANN) to avoid the drawbacks of manual tuning of parameters in Hadoop-Spark system and is found to speed up the performance of the Spark system by 35% compared with default parameter configuration.
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
TLDR
This work attempts to analyze the effect of various configuration parameters on Hadoop Map-Reduce performance under various conditions, and suggests an optimal value in each case to achieve maximum throughput.
Otterman: A Novel Approach of Spark Auto-tuning by a Hybrid Strategy
TLDR
Otterman, a parameters optimization approach based on the combination of Simulated Annealing algorithm and Least Squares method, which can help to dynamically adjust parameters according to job types to obtain optimal configuration to improve performance.
JellyFish: Online Performance Tuning with Adaptive Configuration and Elastic Container in Hadoop Yarn
  • Xiaoan DingYi LiuD. Qian
  • Computer Science
    2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)
  • 2015
TLDR
Experimental results show that JellyFish can improve performance of MapReduce jobs by an average of 24% for jobs run for the first time, and by anAverage of 65% for Jobs run multiple times compared to default YARN.
Auto-Tuning Spark Configurations Based on Neural Network
TLDR
A neural network based configuration tuning approach is proposed which is trained to predict the increase or decrease of configurations which determines the next search space and outperforms over related approaches with optimal configuration and less search time.
Towards Automatic Tuning of Apache Spark Configuration
TLDR
This paper investigates machine learning based approaches to construct application specific performance influence models, and uses them to tune the performance of specific applications running on Apache Spark platform.
Random Search for Hyper-Parameter Optimization
TLDR
This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid, and shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper- parameter optimization algorithms.
Bayesian Optimization in a Billion Dimensions via Random Embeddings
TLDR
Empirical results confirm that REMBO can effectively solve problems with billions of dimensions, provided the intrinsic dimensionality is low, and show thatREMBO achieves state-of-the-art performance in optimizing the 47 discrete parameters of a popular mixed integer linear programming solver.
A Practical Guide to Support Vector Classication
TLDR
A simple procedure is proposed, which usually gives reasonable results and is suitable for beginners who are not familiar with SVM.
Speed-up Extension to Hadoop System
TLDR
Apaches Hadoop is open source implementation of Google Map/Reduce framework, it enables data intensive, distributed and parallel applications by diving massive job into smaller tasks and massive data sets into smaller partition such that each task processes a different partition in parallel.
...
...