Twitter Heron: Stream Processing at Scale
@article{Kulkarni2015TwitterHS, title={Twitter Heron: Stream Processing at Scale}, author={Sanjeev Kulkarni and Nikunj Bhagat and Maosong Fu and Vikas Kedigehalli and Christopher Kellogg and Sailesh Mittal and Jignesh M. Patel and Karthikeyan Ramasamy and Siddarth Taneja}, journal={Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data}, year={2015} }
Storm has long served as the main platform for real-time analytics at Twitter. [] Key Result In this paper, we also provide empirical evidence demonstrating the efficiency and scalability of Heron.
Figures and Tables from this paper
582 Citations
Twitter Heron: Towards Extensible Streaming Engines
- Computer Science2017 IEEE 33rd International Conference on Data Engineering (ICDE)
- 2017
The challenges faced when transforming Heron from a system tailored for Twitter's applications and software stack to a system that efficiently handles applications with diverse characteristics on top of various Big Data platforms are discussed.
Low Latency Stream Processing : Twitter Heron with Infiniband and Omni-Path
- Computer Science
- 2017
The authors present their findings on integrating Twitter Heron distributed stream processing system with two high performance interconnects; Infiniband and Intel Omni-Path.
Realtime Data Processing at Facebook
- Computer ScienceSIGMOD Conference
- 2016
This paper identifies five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness in the realtime stream processing systems Puma, Swift, and Stylus and illustrates how these decisions and systems satisfy the requirements for multiple use cases at Facebook.
Automatic Scaling of Resources in a Storm Topology
- Computer ScienceALGOCLOUD
- 2017
ARiSTO is proposed, a system that automatically decides on the appropriate amount of resources to be provisioned for each node of the Storm workflow topology based on user-defined performance and cost constraints and elastically auto-scales the allocated resources in order to maintain the desired performance even under changes in load.
Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing
- Computer ScienceJournal of Big Data
- 2016
For getting up-to-date insight into online services, extracted data has to be processed in near real time. For example, major big data companies (Facebook, LinkedIn, Twitter) analyse streaming data…
Squall: Stream Processing and Analysis Model Design
- Computer ScienceRACS
- 2017
This Squall framework can be used as a general-purpose big data processing framework because it can overcome the drawbacks of existing Apache storm or Spark streaming by introducing the advantages of Go language.
Dhalion: Self-Regulating Stream Processing in Heron
- Computer ScienceProc. VLDB Endow.
- 2017
The notion of self-regulating streaming systems and the key properties that they must satisfy are introduced and the design and evaluation of Dhalion, a system that provides self-regulation capabilities to underlying streaming systems are presented.
Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter
- Computer ScienceSIGMOD Conference
- 2018
TSAR (TimeSeries AggregatoR), a robust, scalable, real-time event time series aggregation framework built primarily for engagement monitoring: aggregating interactions with Tweets, segmented along a multitude of dimensions such as device, engagement type, etc.
Scaling Event Aggregation at Twitter to Handle Billions of Events per minute
- Computer Science2020 IEEE Infrastructure Conference
- 2020
This paper provides an overview of the Event Aggregation framework used at Twitter, highlight its advantages, and compare it with similar frameworks, and introduces the concept of category group and aggregator group in the architecture.
Neon: Low-Latency Streaming Pipelines for HPC
- Computer Science2021 IEEE 14th International Conference on Cloud Computing (CLOUD)
- 2021
Neon, a clean-slate design of a streaming data processing framework for HPC systems that enables users to create arbitrarily large streaming pipelines, and the experimental results on the Bebop supercomputer show significant performance improvements.
References
SHOWING 1-10 OF 22 REFERENCES
Storm@twitter
- Computer ScienceSIGMOD Conference
- 2014
The architecture of Storm and its methods for distributed scale-out and fault-tolerance are described, how queries are executed in Storm is described, and some operational stories based on running Storm at Twitter are presented.
Stormy: an elastic and highly available streaming service in the cloud
- Computer ScienceEDBT-ICDT '12
- 2012
Stormy is a distributed stream processing service for continuous data processing based on proven techniques from existing Cloud storage systems that are adapted to efficiently execute streaming workloads, while at the same time optimizing resource utilization and increasing cost efficiency.
Summingbird: A Framework for Integrating Batch and Online MapReduce Computations
- Computer ScienceProc. VLDB Endow.
- 2014
The key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion and this means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice it has not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.
Apache Hadoop YARN: yet another resource negotiator
- Computer ScienceSoCC
- 2013
The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Querying and mining data streams: you only get one look a tutorial
- Computer ScienceSIGMOD '02
- 2002
In these situations, algorithms that can summarize the data stream involved in a concise, but reasonably accurate, synopsis that can be stored in the allotted (small) amount of memory and can be used to provide approximate answers to user queries along with some reasonable guarantees on the quality of the approximation are needed.
S4: Distributed Stream Computing Platform
- Computer Science2010 IEEE International Conference on Data Mining Workshops
- 2010
The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.
Photon: fault-tolerant and scalable joining of continuous data streams
- Computer ScienceSIGMOD '13
- 2013
The architecture of Photon is described, a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency, where the streams may be unordered or delayed.
Kafka : a Distributed Messaging System for Log Processing
- Computer Science
- 2011
This work introduces Kafka, a distributed messaging system that was developed for collecting and delivering high volumes of log data with low latency, and shows that Kafka has superior performance when compared to two popular messaging systems.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- Computer ScienceProc. VLDB Endow.
- 2013
In practice, this paper finds that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
The extensibility framework in Microsoft StreamInsight
- Computer Science2011 IEEE 27th International Conference on Data Engineering
- 2011
The extensibility framework in StreamInsight is described; an ongoing effort at Microsoft SQL Server to support the integration of user-defined modules in a stream processing system, in a manner that is easy to use, powerful, and practical.