• Corpus ID: 2992790

Lightweight Asynchronous Snapshots for Distributed Dataflows

@article{Carbone2015LightweightAS,
  title={Lightweight Asynchronous Snapshots for Distributed Dataflows},
  author={Paris Carbone and Gyula F{\'o}ra and Stephan Ewen and Seif Haridi and Kostas Tzoumas},
  journal={ArXiv},
  year={2015},
  volume={abs/1506.08603}
}
Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. [...] Key Method We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.Expand
Consistent High-Availability for Distributed Streaming Computations
Distributed dataflow computations have emerged in response to the need for processing large datasets quickly. Stream processing, a type of distributed dataflow, has enabled real-time analytics, with
Fast and Precise recovery in Stream processing based on Distributed Cache
TLDR
This work saves the intermediate results which are produced during the stream processing, and proposes an algorithm DCAS which asynchronously snapshots state to implements precise recovery, and uses in-memory distributed cache to provide the storage of intermediate results and snapshots to reduce recovery latency.
Efficient Migration of Very Large Distributed State for Scalable Stream Processing
TLDR
An incremental migration mechanism for fine-grained state shards through periodic incremental checkpoints and replica groups is proposed that enables moving large state with minimal impact on stream processing and a low-latency hand-over protocol is presented that smoothly migrates tuples processing among work units.
System-aware dynamic partitioning for batch and streaming workloads
TLDR
A lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead.
Large-Scale Data Stream Processing Systems
TLDR
This chapter introduces the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns, and presents a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor used for a wide variety of processing tasks.
StreamScope: Continuous Reliable Distributed Processing of Big Data Streams
TLDR
RVertex and rStream are introduced, two abstractions that allow efficient and flexible distributed execution and failure recovery, make it easy to reason about correctness even with failures, and facilitate the development, debugging, and deployment of complex multi-stage streaming applications.
Stream processing platforms for analyzing big dynamic data
TLDR
Piglet is introduced, an extended Pig Latin language and code generator that compiles (extended) Pig Latin code into programs for various data processing platforms and discusses the mapping to platform-specic concepts in order to provide a uniform view.
Lineage stash: fault tolerance off the critical path
TLDR
The lineage stash is proposed, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency and makes it possible to support large-scale, low-latency data processing applications with low runtime and recovery overheads.
AutoFlow: Hotspot-Aware, Dynamic Load Balancing for Distributed Stream Processing
TLDR
This work introduces AutoFlow, an automatic, hotspot-aware dynamic load balance system for streaming dataflows that incorporates a centralized scheduler which monitors the load balance in the entire dataflow dynamically and implements state migrations correspondingly.
Asynchronous snapshots of actor systems for latency-sensitive applications
TLDR
This is the first system that enables asynchronous snapshotting of actor applications, i.e. without stop-the-world synchronization, and thereby minimizes the impact on latency, and thus enables new deployment and debugging options for actor systems.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 16 REFERENCES
Naiad: a timely dataflow system
TLDR
It is shown that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining.
TimeStream: reliable stream computation in the cloud
TLDR
This work advocates a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model to handle failure recovery and dynamic reconfiguration in response to load changes.
Making State Explicit for Imperative Big Data Processing
TLDR
The idea is to infer the dataflow and the types of state accesses from a Java program and use this information to generate a stateful dataflow graph (SDG), and it is shown that the performance of SDGs for several imperative online applications matches that of existing data-parallel processing frameworks.
Integrating scale out and fault tolerance in stream processing using operator state management
TLDR
The key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives that can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
TLDR
D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup solutions in streaming databases: parallel recovery of lost state across the cluster.
I An intrqduction to snapshot I algorithms in distributed computing
TLDR
This paper first discusses issues which have to be addressed to compute distributed snapshots in a consistent way, then several algorithms which determine on-the-fly such snapshots are presented for several types of networks (according to the properties of their communication channels).
Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud
While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important
Comet: batched stream processing for data intensive distributed computing
TLDR
A query processing system called Comet is developed that embraces batched stream processing and integrates with DryadLINQ, and when applied to a real production trace covering over 19 million machine-hours shows an estimated I/O saving of over 50%.
Piccolo: Building Fast, Distributed Programs with Partitioned Tables
TLDR
Experiments show Piccolo to be faster than existing data flow models for many problems, while providing similar fault-tolerance guarantees and a convenient programming interface.
The Stratosphere platform for big data analytics
TLDR
The overall system architecture design decisions are presented, Stratosphere is introduced through example queries, and the internal workings of the system’s components that relate to extensibility, programming model, optimization, and query execution are dive into.
...
1
2
...