• Corpus ID: 16927085

Big Data Analytics: An Approach using Hadoop Distributed File System

  title={Big Data Analytics: An Approach using Hadoop Distributed File System},
  author={P. Beaulah Soundarabai and S. Aravindh and J. Thriveni},
Today's world is driven by Growth and Innovation for a better future. All of which are based on analysis and harnessing of tons of data, typically known as Big Data. The tasks involved for achieving results at such a scale can be challenging and painfully slow. This paper works towards an approach for effectively solving a large and computationally intensive problem by leveraging the capabilities of Hadoop and Hbase. Here we demonstrate how to reduce and distribute the large problem across… 
5 Citations

Figures and Tables from this paper

Big Data Analytics Using Apache Hive to Analyze Health Data
  • Pavani Konagala
  • Computer Science
    Nature-Inspired Algorithms for Big Data Frameworks
  • 2019
This chapter specifies a brief study of big data techniques to analyze these types of data, and includes a wide study of Hadoops characteristics, Hadoop architecture, advantages ofbig data and big data eco system.
Big data emerging technologies: A CaseStudy with analyzing twitter data using apache hive
A comprehensive study of major Big Data emerging technologies by highlighting their important features and how they work, with a comparative study between them is presented and represents performance analysis of Apache Hive query for executing Twitter tweets in order to calculate Map Reduce CPU time spent and total time taken to finish the job.
Analyzing BigData with Hadoop cluster in HDInsight azure Cloud
Recently Cloud based Hadoop has gained a lot of interest that offer ready to use Hadoop cluster environment for processing of Big Data, eliminating the operational challenges of on-site hardware
Analysis of User Behavior for Twitter Posts on Hadoop
Tweets are available in JSON format which is to be converted into a structured data which would give the output of how users behave for particular topic in certain country and city.


Starfish: A Self-tuning System for Big Data Analytics
Starfish is introduced, a self-tuning system for big data analytics that builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoops.
Hadoop: The Definitive Guide
This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
h-MapReduce: A Framework for Workload Balancing in MapReduce
This work tackled the workload balancing issue by introducing a hierarchical MapReduce, or h-MapReduce for short, and proved the performance gain over standard Map Reduce for data-intensive algorithms.
Getting more for less in optimized MapReduce workflows
This work offers a novel performance evaluation framework for easing the user efforts of tuning the reduce task settings while achieving performance objectives and validate the accuracy, effectiveness, and performance benefits of the proposed framework using a set of realistic MapReduce applications and queries from the TPC-H benchmark.
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing, due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadooper/HBase.
Application of Hadoop MapReduce technique to Virtual Database system design
  • S. Sathya, M. Victor Jose
  • Computer Science
    2011 International Conference on Emerging Trends in Electrical and Computer Technology
  • 2011
This paper proposes to utilize the parallel and distributed processing capability of Hadoop MapReduce for handling heterogeneous query execution on large datasets and build on top of this will result in effective high performance distributed data integration.
ARIA: automatic resource inference and allocation for mapreduce environments
This work designs a MapReduce performance model and implements a novel SLO-based scheduler in Hadoop that determines job ordering and the amount of resources to allocate for meeting the job deadlines and validate the approach using a set of realistic applications.
Nova: continuous Pig/Hadoop workflows
A workflow manager developed and deployed at Yahoo called Nova is described, which pushes continually-arriving data through graphs of Pig programs executing on Hadoop clusters, which is a good fit for a large fraction of Yahoo's data processing use-cases.
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics
The Elastisizer is introduced, a system to which users can express cluster sizing problems as queries in a declarative fashion and provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation.
Analysis of resource usage profile for MapReduce applications using Hadoop on cloud
  • Zheyuan Liu, Dejun Mu
  • Computer Science
    2012 International Conference on Quality, Reliability, Risk, Maintenance, and Safety Engineering
  • 2012
A study of resource consumption profiles for MapReduce applications using Hadoop on Amazon EC2 using Grep, Word Count and Sort applications while altering Hadooper's configuration parameters corresponding to I/O buffer.