Corpus ID: 11805561

Building LinkedIn's Real-time Activity Data Pipeline

  title={Building LinkedIn's Real-time Activity Data Pipeline},
  author={Ken Goodhope and Joel Koshy and Jay Kreps and Neha Narkhede and Richard Park and Jun Rao and Victor Yang Ye},
  journal={IEEE Data Eng. Bull.},
One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines… Expand
The big data ecosystem at LinkedIn
LinkedIn's Hadoop-based analytics stack is presented, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data, and solutions to the ``last mile'' issues in providing a rich developer ecosystem are presented. Expand
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
A case study illustrating the challenges of real-time data processing in the era of "big data", and the story of how the system was built twice, which points the way to future work on data analytics platforms that can handle "big" as well as "fast" data. Expand
Developing a Real-Time Data Analytics Framework Using Hadoop
This paper proposes architecture based on the Storm/YARN projects for data ingestion, processing exploration and visualization of streaming structured and unstructured data, and implements the proposed architecture using Apache Storm related APIs for both of a local mode and a distributed mode. Expand
Survey of real-time processing systems for big data
A survey of the open source technologies that support big data processing in a real-time/near real- time fashion, including their system architectures and platforms is presented. Expand
Strider: A Hybrid Adaptive Distributed RDF Stream Processing Engine
Strider is proposed, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams and has been designed to guarantee important industrial properties such as scalability, high availability, fault tolerance, high throughput and acceptable latency. Expand
Toward Scalable Systems for Big Data Analytics: A Technology Tutorial
This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges. Expand
Subscribe System built on a DHT Substrate Master of ScienceThesis in the Program Computer Systems and Networks
Thepublish/subscribe pattern is commonly found inmessaging systems andmessage-oriented middleware. When large numbers of processes are publishing messages in applications where low latency and highExpand
A benchmark suite for distributed stream processing systems
A framework was created with an API to generalize the application development and collect metrics, with the possibility of extending it to support other platforms in the future, and the usefulness of the benchmark suite was demonstrated in comparing these systems. Expand
Scaling big data mining infrastructure: the twitter experience
This paper discusses the evolution of Twitter's infrastructure and the development of capabilities for data mining on "big data", and observes that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows. Expand
Adapting CakeDB to Integrate High-Pressure Big Data Streams with Low-Pressure Systems
The paper demonstrates through a use case of a financial trading company and a High Performance Compute Cluster how different applications require different pressures and why it is necessary to be able to scale down high pressure streams for low pressure applications without impacting the applications that require the full high pressure feed. Expand


Kafka : a Distributed Messaging System for Log Processing
This work introduces Kafka, a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency, and shows that Kafka has superior performance when compared to two popular messaging systems. Expand
Chukwa: A System for Reliable Large-Scale Log Collection
This work presents a system, called Chukwa, that embodies the unified approach to failure handling of MapReduce, and uses an end-to-end delivery model that can leverage local on-disk log files for reliability and eases integration with legacy systems. Expand
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by differentExpand
ZooKeeper: Wait-free Coordination for Internet-scale Systems
ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. Expand
Peta-scale data warehousing at Yahoo!
Everest, a SQL-compliant data warehousing engine, based on a column architecture that is in production at Yahoo! since 2007 and currently manages over six petabytes of data is introduced. Expand
Building a high-level dataflow system on top of Map-Reduce
The Map-Reduce scalable dataflow paradigm has become the dominant paradigm for solving the challenge of how to capture, transform and analyze enormous data sets. Expand
Latency Lags Bandwidth
This paper lists a half-dozen performance milestones to document this observation: bandwidth improves by more than the square of the improvement in latency for four different technologies: disks, networks, memories and processors. Expand
The Linux Programming Interface
  • The Linux Programming Interface
  • 2010
, Christopher Olston , Benjamin Reed , Santhosh Srinivasan , and Utkarsh Srivastava . Building a high - level dataflow system on top of mapreduce : the pig experience
  • Proc . VLDB Endow
Apache Software Foundation