Corpus ID: 58458964

Hadoop: The Definitive Guide

@inproceedings{White2009HadoopTD,
  title={Hadoop: The Definitive Guide},
  author={Tom White},
  year={2009}
}
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that… Expand
A case for MapReduce over the internet
TLDR
This paper investigates real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets and proposes and evaluates extensions to Hadoops MapReduced framework to improve its performance in such environments. Expand
Implementations of iterative algorithms in Hadoop and Spark
TLDR
The main contribution of the thesis is to implement the PageRank algorithm and Conjugate Gradient method in Hadoop and Spark, and show how Spark out-performs Hadooper by taking advantage of memory caching. Expand
Big Data Analytics Overview with Hadoop and Spark
In this modern era of technology the amount of data and information has increased vastly. It is essential that we acquire and utilize the appropriate software to manage our ever-growing stored data.Expand
Efficient Ways to Improve the Performance of HDFS for Small Files
TLDR
Hadoop, Hadoop Distributed File System, MapReduce, small file problems and ways to deal with it are introduced. Expand
Assessment of Multiple MapReduce Strategies for Fast Analytics of Small Files
TLDR
An analysis of existing different MapReduce strategies for small files is conducted and theoretical and empirical methods are used to evaluate these strategies for processing small files. Expand
Performance Evaluation of Hadoop Distributed File System and Local File System
TLDR
This work draws a comparison between the HDFS (Hadoop Distributed File System) and LFS performances and sets up a hadoop cluster and design an interface which gives us the size of the file, time taken for upload or download from Local File System (LFS) and Hadoopdistributed file System (HDFS). Expand
Chukwa: A System for Reliable Large-Scale Log Collection
TLDR
This work presents a system, called Chukwa, that embodies the unified approach to failure handling of MapReduce, and uses an end-to-end delivery model that can leverage local on-disk log files for reliability and eases integration with legacy systems. Expand
Beyond Hadoop MapReduce Apache Tez and Apache Spark
Hadoop MapReduce has become the de facto standard for processing voluminous data on large cluster of machines, however this requires any problem to be formulated into strict three-stage processExpand
AN EVALUATION OF THE SPARK PROGRAMMING MODEL FOR BIG DATA ANALYTICS
TLDR
This thesis aims at evaluating the performance offered by the Spark programming model for Big Data Analytics, and evaluates the performance, clustering quality and usability of K-Means Clustering algorithm implementation provided by Spark MLib library against that of Apache Mahout. Expand
HDFS File Formats: Study and Performance Comparison
TLDR
This work study that new file formats to find out their characteristics, and make a theoretical framework to compare them, and easily recognize which formats fit the needs of the data. Expand
...
1
2
3
4
5
...

References

SHOWING 1-3 OF 3 REFERENCES
MapReduce: Simplified Data Processing on Large Clusters
TLDR
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable. Expand
Sorting 1PB with MapReduce,
  • November 21,
  • 2008
TeraByte Sort on Apache Hadoop,
  • 2008