State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing

@article{Carbone2017StateMI,
  title={State Management in Apache Flink{\textregistered}: Consistent Stateful Distributed Stream Processing},
  author={Paris Carbone and Stephan Ewen and Gyula F{\'o}ra and Seif Haridi and Stefan Richter and Kostas Tzoumas},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={1718-1729}
}
Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. [] Key MethodWe present Flink's core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution.

Figures from this paper

Operational Stream Processing: Towards Scalable and Consistent Event-Driven Applications
TLDR
It is strongly believed that streaming dataflows can have a central place in service-oriented architectures, taking over the execution of acid transactions, ensuring message delivery and processing, in order to perform scalable execution of services.
SR3: Customizable Recovery for Stateful Stream Processing Systems
TLDR
SR3 is presented, a customizable state recovery framework that provides fast and scalable state recovery mechanisms for protecting large distributed states in stream processing systems and adopts a decentralized architecture that partitions and replicates states by using consistent ring overlays that leverage distributed hash tables (DHTs).
Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka
TLDR
This work presents Apache Kafka's core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees, and demonstrates how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Amoeba: aligning stream processing operators with externally-managed state
TLDR
Amoeba is described, a system that dynamically adapts data-partitioning schemes and/or task or data placement across systems to eliminate unnecessary network communication across nodes and demonstrates 2.6x performance improvement when aligning SPS tasks with KVS shards in AWS deployments of up to 64 nodes.
Epoch alignment in stateful streams
TLDR
A mechanism to align the progress of multiple independent jobs sharing common event sources is proposed and it is shown that this so called epoch alignment can be achieved with minimal additional costs over exactly-once processing semantics.
Rhino: Efficient Management of Very Large Distributed State for Stream Processing Engines
TLDR
Rhino provides a handover protocol and a state migration protocol to consistently and efficiently migrate stream processing among servers and reconfigures a running query 15 times faster than the state-of-the-art, and reduces latency by three orders of magnitude upon a reconfiguration.
FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications
TLDR
FP4S is a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications and is applied on Apache Storm and evaluated using large-scale real-world experiments, which demonstrate its scalability, efficiency, and fast failure recovery features.
TSpoon: Transactions on a stream processor
A Cloud Native Platform for Stateful Streaming
TLDR
This work presents the architecture of a cloud native version of IBM Streams, with Kubernetes as the target platform, and eliminates 75% of the original platform code.
Lineage stash: fault tolerance off the critical path
TLDR
The lineage stash is proposed, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency and makes it possible to support large-scale, low-latency data processing applications with low runtime and recovery overheads.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Integrating scale out and fault tolerance in stream processing using operator state management
TLDR
The key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives that can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.
ZooKeeper: Wait-free Coordination for Internet-scale Systems
TLDR
ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
TLDR
In practice, this paper finds that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
Naiad: a timely dataflow system
TLDR
It is shown that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining.
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
TLDR
D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster.
Meteor Shower: A Reliable Stream Processing System for Commodity Data Centers
TLDR
The proposed Meteor Shower is a new fault-tolerant DSPS that overcomes large-scale burst failures while improving overall performance, and is a suite of three new techniques: source preservation, parallel, asynchronous check pointing, and 3) application-aware check pointing.
Consistent Regions: Guaranteed Tuple Processing in IBM Streams
TLDR
This paper describes how IBM Streams, an enterprise-grade stream processing system, was enabled to provide data processing guarantees, and the solution goes from language-level abstractions to a runtime protocol.
Apache Hadoop YARN: yet another resource negotiator
TLDR
The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
TLDR
One such approach is presented, the Dataflow Model, along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.
Scalable Distributed Stream Processing
TLDR
The architectural challenges facing the design of large-scale distributed stream processing systems are described, and novel approaches for addressing load management, high availability, and federated operation issues are discussed.
...
1
2
3
4
...