An Efficient Approach for Storing and Accessing Small Files with Big Data Technology

  title={An Efficient Approach for Storing and Accessing Small Files with Big Data Technology},
  author={Bharti Gupta and Rajender Nath and Girdhar Gopal and Kartik},
  journal={International Journal of Computer Applications},
Hadoop is an open source Apache project and a software framework for distributed processing of large datasets across large clusters of computers with commodity hardware. Large datasets include terabytes or petabytes of data where as large clusters means hundreds or thousands of nodes. It supports master slave architecture, which involves one master node and thousands of slave nodes. NameNode acts as the master node which stores all the metadata of files and various data nodes are slave nodes… 

A Review of Various Optimization Schemes of Small Files Storage on Hadoop

The basic architecture of Hadoop system is introduced, and problems, which are generated when Hadoops handles a large number of small files, are analyzed and summarized, and the necessity of optimization scheme of small file storage based on Hadooper is showed.

Performance Analysis of Small Files in HDFS using Clustering Small Files based on Centroid Algorithm

  • R. RathideviR. Parameswari
  • Computer Science
    2020 Fourth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)
  • 2020
Clustering Small Files based on Centroid (CSFC) approach is used to place the related files in a cluster to process large-sized file Hadoop.

Available techniques in hadoop small file issue

One of hadoop’s limitations, called “big data in small files” accrued when a massive number of small files pushed into a hadoop cluster which will rides the cluster to shut down totally is high lighted.

SFSAN Approach for Solving the Problem of Small Files in Hadoop

This paper proposed an enhancement of the Sequence files approach called Small Files Search and Aggregation Node (SFSAN) approach, which improves the Hadoop performance by overcoming some of the limitations of the sequence files approach and keeping up its advantages.

Small files problem in Hadoop -A Survey

This paper provides a comparative study of various methods for handling small files in the Hadoop system.

CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

In the Proposed system CSFC (Clustering Small Files based on Centroid) Clustering Technique is used without mentioning the number of Clusters previously because if the clusters are mentioned before, all the files are clubbed within the limited number of clusters.

Small Files Consolidation Technique in Hadoop Cluster

The proposed Small File Consolidation (SFC) is to overcome some of the current challenges with respect to performance of Hadoop Cluster and will improve the query execution time by generating the result set quickly which will result the effective management of cluster usage.

An Approach for Effectively Handling Small-Size Image Files in Hadoop

The approach used in this paper is shown to be efficient than the solution provided by HIPI (Hadoop Image processing Interface), and form a perfect application domain for evaluating solutions for small size file handling problem in Hadoop.

Application of Computer Big Data in Internet Learning

It is believed that Internet learning is the major trend in the future teaching reform and development, and more attention should be paid to the students’ experience and the optimization and upgrading of related technologies in the process of reform andDevelopment.

Resolving data interoperability in ubiquitous health profile using semi-structured storage and processing

The Ubiquitous Health Profile (UHPr), which enables a semantic solution to the data interoperability problem, in the domain of healthcare1.



THE optimization of HDFS based on small files

  • Liu JiangBing LiMeina Song
  • Computer Science
    2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT)
  • 2010
This article optimize the HDFS I/O feature based on small files, the basic idea is let one block save many small files and let the datanode save some meta-data of small files in it's memory.

A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files

The experimental results indicate that the proposed approach is able to effectively mitigate the load of NameNode and to improve the efficiency of storing and accessing massive small files on HDFS.

Improving metadata management for small files in HDFS

This work proposes a mechanism to store small files in HDFS efficiently and improve the space utilization for metadata, and provides for new job functionality to allow for in-job archival of directories and files so that running MapReduce programs may complete without being killed by the JobTracker due to quota policies.

Hadoop: The Definitive Guide

This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.

Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

This paper compares the Fat-Btree based data access method, which excludes center node in clusters, with Hadoop, and shows their different performance in different file I/O applications.

The Hadoop Distributed File System

The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.

Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS

This paper proposes an approach to optimize I/O performance of small files on HDFS by combining small files into large ones to reduce the file number and build index for each file.

Performance analysis of Hadoop for handling small files in single node

Through experiments with some typical file sets in a single node, Hadoop’s performance on small files under different FileInputFormat is compared and the performance differences are explained by Hadoops own execution principle.

A digital library architecture supporting massive small files and efficient replica maintenance

A service infrastructure based on distributed file system for massive storage in digital library is presented, and a novel dynamic replica number adjustment scheme is proposed to ensure the maximal availability and reliability in a limited storage space.