• Corpus ID: 58458964

Hadoop: The Definitive Guide

  title={Hadoop: The Definitive Guide},
  author={Tom White},
  • Tom White
  • Published 29 May 2009
  • Computer Science
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that… 

Figures and Tables from this paper

A case for MapReduce over the internet

This paper investigates real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets and proposes and evaluates extensions to Hadoops MapReduced framework to improve its performance in such environments.

Implementations of iterative algorithms in Hadoop and Spark

The main contribution of the thesis is to implement the PageRank algorithm and Conjugate Gradient method in Hadoop and Spark, and show how Spark out-performs Hadooper by taking advantage of memory caching.

An Optimal Solution for small file problem in Hadoop

This research work proposed better efficient technique to handle small file handling problem with Hadoop based on file merging technique, hashing and caching and results in saving memory at NameNode, average memory usage at DataNode and improves the access efficiency as compared to the existing techniques.

Big Data Analytics Overview with Hadoop and Spark

This work shall describe how Apache Hadoop and Spark functions across various Operating Systems as well as how it is used for the analyses of large and diverse datasets.

Efficient Ways to Improve the Performance of HDFS for Small Files

Hadoop, Hadoop Distributed File System, MapReduce, small file problems and ways to deal with it are introduced.

FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy

This work proposes two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoops.

Assessment of Multiple MapReduce Strategies for Fast Analytics of Small Files

An analysis of existing different MapReduce strategies for small files is conducted and theoretical and empirical methods are used to evaluate these strategies for processing small files.

Performance Evaluation of Hadoop Distributed File System and Local File System

This work draws a comparison between the HDFS (Hadoop Distributed File System) and LFS performances and sets up a hadoop cluster and design an interface which gives us the size of the file, time taken for upload or download from Local File System (LFS) and Hadoopdistributed file System (HDFS).

Chukwa: A System for Reliable Large-Scale Log Collection

This work presents a system, called Chukwa, that embodies the unified approach to failure handling of MapReduce, and uses an end-to-end delivery model that can leverage local on-disk log files for reliability and eases integration with legacy systems.

Beyond Hadoop MapReduce Apache Tez and Apache Spark

This paper delves into Hadoop and MapReduce architecture and its shortcomings and examines alternatives such as Apache Tez and Apache Spark for their suitability in iterative and interactive workloads.



MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

Sorting 1PB with MapReduce,

  • November 21,
  • 2008

TeraByte Sort on Apache Hadoop,

  • 2008