• Corpus ID: 11805561

Building LinkedIn's Real-time Activity Data Pipeline

@article{Goodhope2012BuildingLR,
  title={Building LinkedIn's Real-time Activity Data Pipeline},
  author={Ken Goodhope and Joel Koshy and Jay Kreps and Neha Narkhede and Richard Park and Jun Rao and Victor Yang Ye},
  journal={IEEE Data Eng. Bull.},
  year={2012},
  volume={35},
  pages={33-45}
}
One trend in the implementation of modern web systems is the use of activity data in the form of log or event messages that capture user and server activity. This data is at the heart of many internet systems in the domains of advertising, relevance, search, recommendation systems, and security, as well as continuing to fulfill its traditional role in analytics and reporting. Many of these uses place real-time demands on data feeds. Activity data is extremely high volume and real-time pipelines… 

The big data ecosystem at LinkedIn

LinkedIn's Hadoop-based analytics stack is presented, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data, and solutions to the ``last mile'' issues in providing a rich developer ecosystem are presented.

Fast data in the era of big data: Twitter's real-time related query suggestion architecture

A case study illustrating the challenges of real-time data processing in the era of "big data", and the story of how the system was built twice, which points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

Developing a Real-Time Data Analytics Framework Using Hadoop

This paper proposes architecture based on the Storm/YARN projects for data ingestion, processing exploration and visualization of streaming structured and unstructured data, and implements the proposed architecture using Apache Storm related APIs for both of a local mode and a distributed mode.

Strider: A Hybrid Adaptive Distributed RDF Stream Processing Engine

Strider is proposed, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams and has been designed to guarantee important industrial properties such as scalability, high availability, fault tolerance, high throughput and acceptable latency.

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial

This paper presents a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics, and presents the prevalent Hadoop framework for addressing big data challenges.

Subscribe System built on a DHT Substrate Master of ScienceThesis in the Program Computer Systems and Networks

This thesis examines how the popular messaging systems RabbitMQ and Kafka handle this situation when topic-based message filtering is used to model subscriptions, and introduces a prototype messaging system, Ingeborg, which is built on the key-value store Riak, allowing even greater flexibility in prioritizing trade-offs and performance properties of a system.

A benchmark suite for distributed stream processing systems

A framework was created with an API to generalize the application development and collect metrics, with the possibility of extending it to support other platforms in the future, and the usefulness of the benchmark suite was demonstrated in comparing these systems.

Scaling big data mining infrastructure: the twitter experience

This paper discusses the evolution of Twitter's infrastructure and the development of capabilities for data mining on "big data", and observes that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows.

Adapting CakeDB to Integrate High-Pressure Big Data Streams with Low-Pressure Systems

The paper demonstrates through a use case of a financial trading company and a High Performance Compute Cluster how different applications require different pressures and why it is necessary to be able to scale down high pressure streams for low pressure applications without impacting the applications that require the full high pressure feed.

Modelling Data Pipelines

This study gives an overview of how to design a conceptual model of data Pipeline which can be used for automation of monitoring, fault detection, mitigation and alarming at different steps of data pipeline.
...

References

SHOWING 1-10 OF 17 REFERENCES

Kafka : a Distributed Messaging System for Log Processing

This work introduces Kafka, a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency, and shows that Kafka has superior performance when compared to two popular messaging systems.

Chukwa: A System for Reliable Large-Scale Log Collection

This work presents a system, called Chukwa, that embodies the unified approach to failure handling of MapReduce, and uses an end-to-end delivery model that can leverage local on-disk log files for reliability and eases integration with legacy systems.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.

ZooKeeper: Wait-free Coordination for Internet-scale Systems

ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.

Peta-scale data warehousing at Yahoo!

Everest, a SQL-compliant data warehousing engine, based on a column architecture that is in production at Yahoo! since 2007 and currently manages over six petabytes of data is introduced.

Latency Lags Bandwidth

This paper lists a half-dozen performance milestones to document this observation: bandwidth improves by more than the square of the improvement in latency for four different technologies: disks, networks, memories and processors.

Building a high-level dataflow system on top of map-reduce: the pig experience

  • Proc. VLDB Endow
  • 2009

Apache Software Foundation. Flume. https://cwiki.apache.org/FLUME

  • Apache Software Foundation. Flume. https://cwiki.apache.org/FLUME

, Christopher Olston , Benjamin Reed , Santhosh Srinivasan , and Utkarsh Srivastava . Building a high - level dataflow system on top of mapreduce : the pig experience

  • Proc . VLDB Endow

The Linux Programming Interface

  • The Linux Programming Interface
  • 2010