• Corpus ID: 220265829

Lachesis: Automated Generation of Persistent Partitionings for Big Data Applications

@article{Zou2020LachesisAG,
  title={Lachesis: Automated Generation of Persistent Partitionings for Big Data Applications},
  author={Jia Zou and Pratik Barhate and Amitabha Das and Arun Iyengar and Binhang Yuan and Dimitrije Jankov and Chis Jermaine},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.16529}
}
Persistent partitioning is effective in improving the performance by avoiding the expensive shuffling operation, while incurring relatively small overhead. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions. That is because user defined functions coded with an object-oriented language such as Python, Scala, Java, can contain arbitrary code that is opaque to the system and makes it hard to extract and… 

Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning

This work proposes to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes.

Serving Deep Learning Models with Deduplication from Relational Databases

This work proposed synergistic storage optimization techniques for duplication detection, page packing, and caching, to enhance database systems for model serving, and outperformed existing deep learning frameworks in targeting scenarios.

A Survey on Deep Reinforcement Learning for Data Processing and Analytics

This work provides a comprehensive review of recent works focusing on utilizing DRL to improve data processing and analytics, and presents an introduction to key concepts, theories, and methods in DRL.

Benchmark of DNN Model Search at Deployment Time

The experimental evaluation showed that the proposed asymmetric similarity-based measurement, adaptivity, outperformed symmetric similarity -based measurements and non-similarity-based measurements in most of the workloads.

References

SHOWING 1-10 OF 61 REFERENCES

Weld : A Common Runtime for High Performance Data Analytics

Weld is proposed, a runtime for data-intensive applications that optimizes across disjoint libraries and functions that uses a common intermediate representation to capture the structure of diverse dataparallel workloads, including SQL, machine learning and graph analytics.

Advanced partitioning techniques for massively distributed computation

An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide

SystemML: Declarative Machine Learning on Spark

This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics.

Automated partitioning design in parallel database systems

This paper presents a partitioning advisor that recommends the best partitioning design for an expected workload and its techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time.

Morpheus: Towards Automated SLOs for Enterprise Clusters

Morpheus is a new system that codifies implicit user expectations as explicit Service Level Objectives (SLOs) inferred from historical data, enforces SLOs using novel scheduling techniques that isolate jobs from sharing-induced performance variability, and mitigates inherent performance variance by means of dynamic reprovisioning of jobs.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

Learning a Partitioning Advisor with Deep Reinforcement Learning

A learned partitioning advisor for analytical OLAP-style workloads based on Deep Reinforcement Learning (DRL) is introduced and it is shown that this advisor is not only able to find partitionings that outperform existing approaches for automated partitioning design but that it also can easily adjust to different deployments.

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

CoHadoop is introduced, a lightweight extension of Hadoop that allows applications to control where data are stored and performs better than repartition-based algorithms, but also outperforms map-only algorithms that do exploit data partitioning but not colocation.

Learning a Partitioning Advisor for Cloud Databases

A new learned partitioning advisor based on Deep Reinforcement Learning (DRL) for OLAP-style workloads is introduced that is able to find non-trivial partitionings for a wide range of workloads and outperforms more classical approaches for automated partitioning design.

Pangea: Monolithic Distributed Storage for Data Analytics

A single system called Pangea is proposed that can manage all data---both intermediate and long-lived data, and their buffer/caching, data placement optimization, and failure recovery---all in one monolithic storage system, without any layering.
...