Phoebe: A Learning-based Checkpoint Optimizer

@article{Zhu2021PhoebeAL,
  title={Phoebe: A Learning-based Checkpoint Optimizer},
  author={Yiwen Zhu and Matteo Interlandi and Abhishek Roy and Krishnadhan Das and Hiren Patel and Malay Bag and Hitesh Kumar Sharma and Alekh Jindal},
  journal={Proc. VLDB Endow.},
  year={2021},
  volume={14},
  pages={2505-2518}
}
Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft… 
Deploying a Steered Query Optimizer in Production at Microsoft
TLDR
The resulting system, QO-Advisor, essentially externalizes the query planner to a massive offline pipeline for better exploration and specialization and shows detailed results over production SCOPE workloads at Microsoft, where the system is currently enabled by default.
Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing
TLDR
This paper proposes a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level RO decisions well under a second.
Machine Learning for Cloud Data Systems: the Promise, the Progress, and the Path Forward
TLDR
The goal of this tutorial is to educate the audience about the state of the art in ML for cloud data systems, both in research and in practice, and compare and contrast the promise of ML for systems with the ground actually covered in industry.

References

SHOWING 1-10 OF 62 REFERENCES
A selective checkpointing mechanism for query plans in a parallel database system
  • Ting Chen, K. Taura
  • Computer Science
    2013 IEEE International Conference on Big Data
  • 2013
TLDR
A selective checkpointing mechanism which materializes the outputs of some necessary operators, enabling to resume queries from middle of the execution upon failures, and can choose reasonable operators to be checkpointed and outperforms other fault-tolerant strategies.
Flint: batch-interactive data-intensive processing on transient servers
TLDR
Flint is designed, which is based on Spark and includes automated checkpointing and server selection policies that support batch and interactive applications and dynamically adapt to application characteristics, and yields cost savings of up to 90% compared to using on-demand servers.
Selecting Subexpressions to Materialize at Datacenter Scale
TLDR
The problem of subexpression selection for large workloads, i.e., selecting common parts of job plans and materializing them to speed-up the evaluation of subsequent jobs is focused on and BigSubs, a vertex-centric graph algorithm is introduced to iteratively choose in parallel which subexpressions to materialize and which sub expressions to use for evaluating each job.
Cost-based Fault-tolerance for Parallel Data Processing
TLDR
The experiments show that the cost-based fault-tolerance scheme outperforms all existing strategies and always selects the sweet spot for short- and long running queries as well as for different cluster setups.
Integrating scale out and fault tolerance in stream processing using operator state management
TLDR
The key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives that can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.
RIOS: Runtime Integrated Optimizer for Spark
TLDR
RIOS is a Runtime Integrated Optimizer for Spark that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions, and it is shown that better plans can be derived at runtime.
SCOPE: easy and efficient parallel processing of massive data sets
TLDR
A new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis, designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters.
Towards a Learning Optimizer for Shared Clouds
TLDR
A machine learning based approach to learn cardinality models from previous job executions and use them to predict the cardinalities in future jobs, and describes the feedback loop to apply the learned models back to future job executions.
Microlearner: A fine-grained Learning Optimizer for Big Data Workloads at Microsoft
TLDR
This paper describes building a learning query optimizer for big data workloads at Microsoft and presents Microlearner, a practical approach to characterize large cloud workloads into smaller subsets and build micromodels over each subset to tame the complexity of bigData workloads.
Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings
TLDR
The results show that the learned cost models are 2 to 3 orders of magnitude more accurate, and 20X more correlated with the actual runtimes, with a large majority (70%) of the plan changes leading to substantial improvements in latency as well as resource usage.
...
...