Apache hadoop goes realtime at Facebook

  title={Apache hadoop goes realtime at Facebook},
  author={Dhruba Borthakur and Jonathan Gray and Joydeep Sen Sarma and Kannan Muthukkaruppan and Nicolas Spiegelberg and Hairong Kuang and Karthika Ranganathan and Dmytro Molkov and Aravind Menon and Samuel Rash and Rodrigo Schmidt and Amitanand S. Aiyer},
  booktitle={SIGMOD '11},
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. [...] Key Result We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.Expand
Survey Paper on Big Data Processing and Hadoop Components
The MapReduce framework based on Hadoop and the current state-of-the-art in Map Reduce algorithms for big data analysis are introduced.
Critical Insight for MapReduce Optimization in Hadoop
A comprehensive review of some of the major works has been done to discuss the prominence of issues, which will be needed to be taken care of while developing the same in future.
Automated Table Partitioner (ATAP) in Apache Hive
A novel mean of automating the table partitioning in Hive is proposed that includes a lexical analyzer that reads HiveQL queries and, in return, issues Data Definition Language (DDL) for table restructure if a particular column is read more than the user-set coefficient factor.
DNN: A Distributed NameNode Filesystem for Hadoop
It is argued that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system, and a novel distributed NameNode architecture based on the flat namespace is proposed that improves both the availability and scalability of HDFS.
On the use of microservers in supporting hadoop applications
This paper conducts a quantitative study of six representative Hadoop applications on five hardware configurations, and defines a comprehensive metric, PerfEC, which unifies the performance, energy consumption, and the acquisition and operating costs of the applications, and helps identify appropriate clusters for Hadoops applications.
An Efficient Replicated System for the Metadata of HDFS
This paper presents a solution to enable the high availability for HDFS's namenode through efficient metadata replication, and builds a prototype called NCluster and evaluates it to exhibit its feasibility and effectiveness.
Camdoop: Exploiting In-network Aggregation for Big Data Applications
Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers, is built and demonstrated that it significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ.
Open data challenges at Facebook
This tutorial will describe Facebook's data systems and the current challenges they face, and lead a discussion on these challenges, approaches to solve them, and potential pitfalls to stimulate interest in solving these problems in the research community.
Supporting Scalable Analytics with Latency Constraints
Results from real-world workloads show that the techniques implemented in Incremental Hadoop, reduce its latency from tens of seconds to sub-second, with 2x-5x increase in throughput, and outperforms state-of-the-art distributed stream systems, Storm and Spark Streaming, when combining latency and throughput.
Achieving Dynamic Resource Allocation in the Hadoop Cloud System
A new scheme enabling dynamic resource allocation to jobs selected by job schedulers is designed and implemented, which could help some current and future Hadoop job Schedulers speed up the execution of jobs with high priority.


ZooKeeper: Wait-free Coordination for Internet-scale Systems
ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.
Bigtable: A Distributed Storage System for Structured Data
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
MapReduce: Simplified Data Processing on Large Clusters
This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
The log-structured merge-tree (LSM-tree)
The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period.
Facebook Messages
  • Facebook Messages
  • 2010
Available at http://en.wikipedia.org/wiki/Fsck
  • Available at http://en.wikipedia.org/wiki/Fsck
Available at http://en.wikipedia.org/wiki/Memcached [10] Scribe. Available at http://github.com/facebook/scribe
  • Available at http://en.wikipedia.org/wiki/Memcached [10] Scribe. Available at http://github.com/facebook/scribe
Available at http://hadoop.apache
  • Available at http://hadoop.apache
Available at http://hadoop.apache.org/hdfs
  • Available at http://hadoop.apache.org/hdfs
Available at http://hbase.apache
  • Available at http://hbase.apache