An Improved Approach for Analysis of Hadoop Data for All Files

@article{Jain2017AnIA,
  title={An Improved Approach for Analysis of Hadoop Data for All Files},
  author={Heena Jain and Ajay Goyal},
  journal={International Journal of Computer Applications},
  year={2017},
  volume={157},
  pages={15-20}
}
  • Heena Jain, Ajay Goyal
  • Published 17 January 2017
  • Computer Science
  • International Journal of Computer Applications
Here in this paper an efficient Framework is implemented for Hadoop Platform for almost all types of Files. The Proposed Methodology implemented here is based on various algorithms implemented on Hadoop Platform such as Scan, Read, Sort etc. Various Workloads are used for the Analysis of the Algorithms of small and big size such as Facebook, Co-author, and Twitter. The Experimental results show the performance of the proposed methodology. The Methodology provides efficient Running Time… 
Optimization of hadoop small file storage using priority model
  • V. Nivedita, J. Geetha
  • Computer Science
    2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT)
  • 2017
TLDR
The different techniques that are developed to handle the storage of small files are discussed and a new method to solve the storage issue in hadoop is proposed which takes the priority of the file as a distinguishing factor which helps in reducing the memory usage of namenode.
Performance Evaluation of Apache Hadoop Benchmarks under a Dynamic Checkpointing Mechanism
TLDR
This work presents how the framework Apache Hadoop implements the Checkpoint and Recovery technique for fault tolerance providing on its distributed file system (Hadoop Distributed File System).
Experimentation and Analysis of Dynamic Checkpoint on Apache Hadoop with Failure Scenarios
TLDR
This work uses a dynamic configuration mechanism for checkpoint on Hadoop and evaluates its performance on scenarios with induced fault on the master element of HDFS.
Employment of Optimal Approximations on Apache Hadoop Checkpoint Technique for Performance Improvements
TLDR
Improvements for the DCA are presented through the configuration of the Hadoop checkpoint period in real-time based on optimal period approximations that were already endorsed by the literature and the evaluation results show that an adaptive configuration of checkpoint periods reduces the wasted time caused by failures in the NameNode and improvesHadoop performance.
Employment of Optimal Approximations on Apache Hadoop Checkpoint Technique for Performance Improvements
TLDR
Improvements for the DCA are presented through the configuration of the Hadoop checkpoint period in real-time based on optimal period approximations that were already endorsed by the literature and the evaluation results show that an adaptive configuration of checkpoint periods reduces the wasted time caused by failures in the NameNode and improvesHadoop performance.
Validation of a dynamic checkpoint mechanism for Apache Hadoop with failure scenarios
TLDR
This paper proposes a dynamic solution for checkpoint attributes configuration on Hadoop Distributed File System (HDFS), whose goal is to make it adaptable to system usage context.
Política Customizada de Balanceamento de Réplicas para o HDFS Balancer do Apache Hadoop
TLDR
A customized balancing policy for HDFS Balancer is proposed based on a system of priorities, which can be adapted and configured according to usage demands, thus making the balancing more flexible.

References

SHOWING 1-10 OF 16 REFERENCES
The Hadoop Distributed File System
TLDR
The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Fault Tolerance in Hadoop for Work Migration
TLDR
This survey paper is focused around HDFS and how it was implemented to be very fault tolerant because fault tolerance is an essential part of modern day distributed systems.
HADOOP SKELETON & FAULT TOLERANCE IN HADOOP CLUSTERS
TLDR
The framework of Hadoop is described along with how fault tolerance is achieved by means of data duplication, which provides faults tolerance mechanism by which system continues to function correctly even after some components fail’s working properly.
SFMapReduce: An optimized MapReduce framework for Small Files
TLDR
This work proposes an optimized MapReduce framework for small files, SFMapReduce, and presents two techniques, Small File Layout (SFLayout) and customized Map Reduce (CMR), used to solve the memory problem and improve I/O performance in HDFS.
MapReduce: simplified data processing on large clusters
TLDR
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds
  • F. Tian, Keke Chen
  • Computer Science
    2011 IEEE 4th International Conference on Cloud Computing
  • 2011
TLDR
This paper builds up a cost function that explicitly models the relationship between the amount of input data, the available system resources, and the complexity of the Reduce function for the target MapReduce job and shows that this cost model performs well on tested Map Reduce programs.
Oivos: Simple and Efficient Distributed Data Processing
TLDR
Oivos; a high-level declarative programming model and its underlying runtime is introduced; it is shown how Oivos programs may specify computations that span multiple heterogeneous and interdependent data sets, how the programs are compiled and optimized, and how the run-time orchestrates and monitors their distributed execution.
Checkpoint-based fault-tolerant infrastructure for virtualized service providers
TLDR
A smart checkpoint infrastructure for virtualized service providers is proposed that allows resuming a task execution faster after a node crash and increasing the fault tolerance of the system, since checkpoints are distributed and replicated in all the nodes of the provider.
Large-scale Image Processing Using MapReduce
TLDR
This thesis looks at the issues of processing two kinds of such data large data-sets of regular images and single large images using MapReduce, and classifies image processing algorithms to iterative/ non-iterative and local/non-local, and presents a general analysis on why different combinations of algorithms and data might be easier or harder to adapt for distributed processing with Map reduce.
XORing Elephants: Novel Erasure Codes for Big Data
TLDR
A novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes are presented, which provides higher reliability, which is orders of magnitude higher compared to replication.
...
...