• Corpus ID: 212455695

An Efficient Approach to Optimize the Performance of Massive Small Files in Hadoop MapReduce Framework

  title={An Efficient Approach to Optimize the Performance of Massive Small Files in Hadoop MapReduce Framework},
  author={Guru Prasad and Swathi C. Prabhu},
The most popular open source distributed computing framework called Hadoop was designed by Doug Cutting and his team, which involves thousands of nodes to process and analyze huge amounts of data called Big Data. The major core components of Hadoop are HDFS (Hadoop Distributed File System) and MapReduce. This framework is the most popular and powerful for store, manage and process Big Data applications. But drawback with this tool related to stability and performance issues for small file… 
1 Citations

Figures and Tables from this paper

Performance Analysis of ECG Big Data using Apache Hive and Apache Pig
It is showed from results that Apache Pig has been considered more efficient and systematic in providing quick results in less time as compared to Apache Hive.


A novel approach to improve the performance of Hadoop in handling of small files
This research work gives an introduction about HDFS, small file problem and existing ways to deal with it these problems along with proposed approach to handle small files, and in proposed approach, merging of small file is done using MapReduce programming model on Hadoop.
Improving Hadoop Performance in Handling Small Files
A solution which is expected to derive their merits while ensuring a better performance of Hadoop is proposed, which was introduced as a solution for the small files problem for the Hadoops Version 0.18.0 onwards.
Improving performance of small-file accessing in Hadoop
This paper proposes a mechanism based on Hadoop Archive (HAR) to improve the memory utilization for metadata and enhance the efficiency of accessing small files in HDFS, and extends HAR capabilities to allow additional files to be inserted into the existing archive files.
SFMapReduce: An optimized MapReduce framework for Small Files
This work proposes an optimized MapReduce framework for small files, SFMapReduce, and presents two techniques, Small File Layout (SFLayout) and customized Map Reduce (CMR), used to solve the memory problem and improve I/O performance in HDFS.
Improving metadata management for small files in HDFS
This work proposes a mechanism to store small files in HDFS efficiently and improve the space utilization for metadata, and provides for new job functionality to allow for in-job archival of directories and files so that running MapReduce programs may complete without being killed by the JobTracker due to quota policies.
A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files
The experimental results indicate that the proposed approach is able to effectively mitigate the load of NameNode and to improve the efficiency of storing and accessing massive small files on HDFS.
A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System
The experimental results indicate that EHDFS is able to reduce the metadata footprint on NameNode's main memory by 16% and also improve the efficiency of storing and accessing large number of small files.
Efficient prefetching technique for storage of heterogeneous small files in Hadoop Distributed File System Federation
This paper develops an efficient approach to handle files from heterogeneous users and to devise an efficient prefetching algorithm based on file access patterns and provides options to modify and delete the files stored by users in Federated HDFS.
Improving the Efficiency of Storing for Small Files in HDFS
The novel approach for small files processing strategy for files efficient merger is presented, which builds the file index and uses boundary file block filling mechanism to accomplish files separation and files retrieval.
Small files storing and computing optimization in Hadoop parallel rendering
Experimental results show that the proposed method significantly reduces the number of Rib files and render tasks, and improves the storage efficiency and computing efficiency of RIB Files.