Apache hadoop goes realtime at Facebook
@inproceedings{Borthakur2011ApacheHG, title={Apache hadoop goes realtime at Facebook}, author={Dhruba Borthakur and Jonathan Gray and Joydeep Sen Sarma and Kannan Muthukkaruppan and Nicolas Spiegelberg and Hairong Kuang and Karthika Ranganathan and Dmytro Molkov and Aravind Menon and Samuel Rash and Rodrigo Schmidt and Amitanand S. Aiyer}, booktitle={ACM SIGMOD Conference}, year={2011} }
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. [] Key Result We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.
Figures from this paper
121 Citations
Survey Paper on Big Data Processing and Hadoop Components
- Computer Science
- 2014
The MapReduce framework based on Hadoop and the current state-of-the-art in Map Reduce algorithms for big data analysis are introduced.
Automated Table Partitioner (ATAP) in Apache Hive
- Computer Science2018 4th International Conference on Computer and Information Sciences (ICCOINS)
- 2018
A novel mean of automating the table partitioning in Hive is proposed that includes a lexical analyzer that reads HiveQL queries and, in return, issues Data Definition Language (DDL) for table restructure if a particular column is read more than the user-set coefficient factor.
DNN: A Distributed NameNode Filesystem for Hadoop
- Computer Science
- 2014
It is argued that HDFS should have a flat namespace instead of the hierarchical one as used in traditional POSIX-based file system, and a novel distributed NameNode architecture based on the flat namespace is proposed that improves both the availability and scalability of HDFS.
An Efficient Replicated System for the Metadata of HDFS
- Computer Science
- 2016
This paper presents a solution to enable the high availability for HDFS's namenode through efficient metadata replication, and builds a prototype called NCluster and evaluates it to exhibit its feasibility and effectiveness.
Camdoop: Exploiting In-network Aggregation for Big Data Applications
- Computer ScienceNSDI
- 2012
Camdoop, a MapReduce-like system running on CamCube, a cluster design that uses a direct-connect network topology with servers directly linked to other servers, is built and demonstrated that it significantly reduces the network traffic and provides high performance increase over a version of Camdoop running over a switch and against two production systems, Hadoop and Dryad/DryadLINQ.
Open data challenges at Facebook
- Computer Science2015 IEEE 31st International Conference on Data Engineering
- 2015
This tutorial will describe Facebook's data systems and the current challenges they face, and lead a discussion on these challenges, approaches to solve them, and potential pitfalls to stimulate interest in solving these problems in the research community.
Supporting Scalable Analytics with Latency Constraints
- Computer ScienceProc. VLDB Endow.
- 2015
Results from real-world workloads show that the techniques implemented in Incremental Hadoop, reduce its latency from tens of seconds to sub-second, with 2x-5x increase in throughput, and outperforms state-of-the-art distributed stream systems, Storm and Spark Streaming, when combining latency and throughput.
Achieving Dynamic Resource Allocation in the Hadoop Cloud System
- Computer ScienceIOV
- 2019
A new scheme enabling dynamic resource allocation to jobs selected by job schedulers is designed and implemented, which could help some current and future Hadoop job Schedulers speed up the execution of jobs with high priority.
Dual-JT: Toward the high availability of JobTracker in Hadoop
- Computer Science4th IEEE International Conference on Cloud Computing Technology and Science Proceedings
- 2012
This paper designs a solution to resolve the single point of failure of the Job Tracker and then enhance its availability and introduces a standby Job Tracker to act as a hot backup node of the active Job Tracker.
Implementation of parallel Hash Join algorithms over Hadoop
- Computer Science
- 2011
The basic idea behind this work is to modify the query evaluation techniques used by parallel database management systems in order to use the Hadoop MapReduce framework as the underlying execution engine.
References
SHOWING 1-10 OF 17 REFERENCES
ZooKeeper: Wait-free Coordination for Internet-scale Systems
- Computer ScienceUSENIX Annual Technical Conference
- 2010
ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.
MapReduce: simplified data processing on large clusters
- Computer ScienceCACM
- 2008
This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
The log-structured merge-tree (LSM-tree)
- Computer ScienceActa Informatica
- 2009
The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period.
Bigtable: A Distributed Storage System for Structured Data
- Computer ScienceTOCS
- 2008
The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Available at http://hadoop.apache
- Available at http://hadoop.apache
Available at http://hive.apache
- Available at http://hive.apache
Available at http://hbase.apache
- Available at http://hbase.apache
that enabled us to make continued improvements to this infrastructure
- that enabled us to make continued improvements to this infrastructure
Facebook Messages
- Facebook Messages
- 2010
Available at http://labs.google.com/papers/mapreduce- osdi04
- MapReduce: Simplified Data Processing on Large Clusters