Twitter Heron: Stream Processing at Scale

@article{Kulkarni2015TwitterHS,
  title={Twitter Heron: Stream Processing at Scale},
  author={Sanjeev Kulkarni and Nikunj Bhagat and Maosong Fu and Vikas Kedigehalli and Christopher Kellogg and Sailesh Mittal and Jignesh M. Patel and Karthikeyan Ramasamy and Siddarth Taneja},
  journal={Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data},
  year={2015}
}
Storm has long served as the main platform for real-time analytics at Twitter. [] Key Result In this paper, we also provide empirical evidence demonstrating the efficiency and scalability of Heron.

Twitter Heron: Towards Extensible Streaming Engines

The challenges faced when transforming Heron from a system tailored for Twitter's applications and software stack to a system that efficiently handles applications with diverse characteristics on top of various Big Data platforms are discussed.

Low Latency Stream Processing : Twitter Heron with Infiniband and Omni-Path

The authors present their findings on integrating Twitter Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path.

Realtime Data Processing at Facebook

This paper identifies five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness in the realtime stream processing systems Puma, Swift, and Stylus and illustrates how these decisions and systems satisfy the requirements for multiple use cases at Facebook.

Automatic Scaling of Resources in a Storm Topology

ARiSTO is proposed, a system that automatically decides on the appropriate amount of resources to be provisioned for each node of the Storm workflow topology based on user-defined performance and cost constraints and elastically auto-scales the allocated resources in order to maintain the desired performance even under changes in load.

Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing

For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data

Squall: Stream Processing and Analysis Model Design

This Squall framework can be used as a general-purpose big data processing framework because it can overcome the drawbacks of existing Apache storm or Spark streaming by introducing the advantages of Go language.

Dhalion: Self-Regulating Stream Processing in Heron

The notion of self-regulating streaming systems and the key properties that they must satisfy are introduced and the design and evaluation of Dhalion, a system that provides self-regulation capabilities to underlying streaming systems are presented.

Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter

TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc.

Scaling Event Aggregation at Twitter to Handle Billions of Events per minute

This paper provides an overview of the Event Aggregation framework used at Twitter, highlight its advantages, and compare it with similar frameworks, and introduces the concept of category group and aggregator group in the architecture.

Neon: Low-Latency Streaming Pipelines for HPC

  • Pierre MatriR. Ross
  • Computer Science
    2021 IEEE 14th International Conference on Cloud Computing (CLOUD)
  • 2021
Neon, a clean-slate design of a streaming data processing framework for HPC systems that enables users to create arbitrarily large streaming pipelines, and the experimental results on the Bebop supercomputer show significant performance improvements.
...

References

SHOWING 1-10 OF 22 REFERENCES

Storm@twitter

The architecture of Storm and its methods for distributed scale-out and fault-tolerance are described, how queries are executed in Storm is described, and some operational stories based on running Storm at Twitter are presented.

Stormy: an elastic and highly available streaming service in the cloud

Stormy is a distributed stream processing service for continuous data processing based on proven techniques from existing Cloud storage systems that are adapted to efficiently execute streaming workloads, while at the same time optimizing resource utilization and increasing cost efficiency.

Summingbird: A Framework for Integrating Batch and Online MapReduce Computations

The key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion and this means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice it has not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

Apache Hadoop YARN: yet another resource negotiator

The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.

Querying and mining data streams: you only get one look a tutorial

In these situations, algorithms that can summarize the data stream involved in a concise, but reasonably accurate, synopsis that can be stored in the allotted (small) amount of memory and can be used to provide approximate answers to user queries along with some reasonable guarantees on the quality of the approximation are needed.

S4: Distributed Stream Computing Platform

The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.

Photon: fault-tolerant and scalable joining of continuous data streams

The architecture of Photon is described, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed.

Kafka : a Distributed Messaging System for Log Processing

This work introduces Kafka, a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency, and shows that Kafka has superior performance when compared to two popular messaging systems.

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

In practice, this paper finds that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

The extensibility framework in Microsoft StreamInsight

The extensibility framework in StreamInsight is described; an ongoing effort at Microsoft SQL Server to support the integration of user-defined modules in a stream processing system, in a manner that is easy to use, powerful, and practical.