Apache hadoop goes realtime at Facebook

  title={Apache hadoop goes realtime at Facebook},
  author={Dhruba Borthakur and Jonathan Gray and Joydeep Sen Sarma and Kannan Muthukkaruppan and Nicolas Spiegelberg and Hairong Kuang and Karthika Ranganathan and Dmytro Molkov and Aravind Menon and Samuel Rash and Rodrigo Schmidt and Amitanand S. Aiyer},
  booktitle={ACM SIGMOD Conference},
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. [] Key Result We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

Figures from this paper

Survey Paper on Big Data Processing and Hadoop Components

The MapReduce framework based on Hadoop and the current state-of-the-art in Map Reduce algorithms for big data analysis are introduced.

Automated Table Partitioner (ATAP) in Apache Hive

A novel mean of automating the table partitioning in Hive is proposed that includes a lexical analyzer that reads HiveQL queries and, in return, issues Data Definition Language (DDL) for table restructure if a particular column is read more than the user-set coefficient factor.

DNN: A Distributed NameNode Filesystem for Hadoop

It is argued that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system, and a novel distributed NameNode architecture based on the flat namespace is proposed that improves both the availability and scalability of HDFS.

An Efficient Replicated System for the Metadata of HDFS

This paper presents a solution to enable the high availability for HDFS's namenode through efficient metadata replication, and builds a prototype called NCluster and evaluates it to exhibit its feasibility and effectiveness.

Camdoop: Exploiting In-network Aggregation for Big Data Applications

Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers, is built and demonstrated that it significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ.

Open data challenges at Facebook

This tutorial will describe Facebook's data systems and the current challenges they face, and lead a discussion on these challenges, approaches to solve them, and potential pitfalls to stimulate interest in solving these problems in the research community.

Supporting Scalable Analytics with Latency Constraints

Results from real-world workloads show that the techniques implemented in Incremental Hadoop, reduce its latency from tens of seconds to sub-second, with 2x-5x increase in throughput, and outperforms state-of-the-art distributed stream systems, Storm and Spark Streaming, when combining latency and throughput.

Achieving Dynamic Resource Allocation in the Hadoop Cloud System

A new scheme enabling dynamic resource allocation to jobs selected by job schedulers is designed and implemented, which could help some current and future Hadoop job Schedulers speed up the execution of jobs with high priority.

Dual-JT: Toward the high availability of JobTracker in Hadoop

  • Jian WanMinggang Liu Wei Wu
  • Computer Science
    4th IEEE International Conference on Cloud Computing Technology and Science Proceedings
  • 2012
This paper designs a solution to resolve the single point of failure of the Job Tracker and then enhance its availability and introduces a standby Job Tracker to act as a hot backup node of the active Job Tracker.

Implementation of parallel Hash Join algorithms over Hadoop

The basic idea behind this work is to modify the query evaluation techniques used by parallel database management systems in order to use the Hadoop MapReduce framework as the underlying execution engine.



ZooKeeper: Wait-free Coordination for Internet-scale Systems

ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.

MapReduce: simplified data processing on large clusters

This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

The log-structured merge-tree (LSM-tree)

The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period.

Bigtable: A Distributed Storage System for Structured Data

The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.

Available at http://hadoop.apache

  • Available at http://hadoop.apache

Available at http://hive.apache

  • Available at http://hive.apache

Available at http://hbase.apache

  • Available at http://hbase.apache

that enabled us to make continued improvements to this infrastructure

  • that enabled us to make continued improvements to this infrastructure

Facebook Messages

  • Facebook Messages
  • 2010

Available at http://labs.google.com/papers/mapreduce- osdi04

  • MapReduce: Simplified Data Processing on Large Clusters