Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse

@article{Hasani2018EfficientCO,
  title={Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse},
  author={Sona Hasani and Saravanan Thirumuruganathan and Abolfazl Asudeh and Nick Koudas and Gautam Das},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={1468-1481}
}
Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse . For… 

Figures and Tables from this paper

ApproxML: Efficient Approximate Ad-Hoc ML Models Through Materialization and Reuse
TLDR
This paper proposes to demonstrate ApproxML, a system that efficiently constructs approximate ML models for new queries from previously constructed ML models using the concepts of model materialization and reuse.
Optimizing Machine Learning Inference Queries with Correlative Proxy Models
TLDR
This paper proposes CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries and shows that CORE improves the query throughput by up to 63% compared to PP and up to 80%Compared to running the queries as it is.
Finding Materialized Models for Model Reuse
TLDR
MMQ is presented, a privacy-protected, general, efficient, and effective materialized model query framework that uses a Gaussian mixture-based metric called separation degree to rank materialized models.
Materialization and Reuse Optimizations for Production Data Science Pipelines
TLDR
This work proposes a materialization algorithm that given a storage budget, materializes the subset of the artifacts to minimize the run time of the subsequent executions and designs a reuse algorithm to generate an execution plan by combining the pipelines into a directed acyclic graph (DAG).
An Efficient Source Model Selection Framework in Model Databases
TLDR
SMS is proposed, an effective, efficient, and flexible source model selection framework, which is effective even when the source and target datasets have significantly different data labels, and is flexible to support source models with any type of structure, and are efficient to avoid any training process.
SMS: An Efficient Source Model Selection Framework for Model Reuse
TLDR
SMS is an effective, efficient and flexible source model selection framework for model reuse, which is effective even when source and target datasets have significantly different data labels, is flexible to support source models with any type of structure, and is efficient to avoid any training process.
Workload-aware Materialization for Efficient Variable Elimination on Bayesian Networks
TLDR
This paper proposes a novel materialization method, which can lead to significant efficiency gains when processing inference queries using the Variable Elimination algorithm, and provides an optimal polynomial-time algorithm and discusses alternative methods.
Shahin: Faster Algorithms for Generating Explanations for Multiple Predictions
TLDR
This work proposes a principled and lightweight approach for identifying redundant computations and several effective heuristics for dramatically speeding up explanation generation and demonstrates this over a diverse set of algorithms including, LIME, Anchor, and SHAP.
Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
TLDR
This paper characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, it is shown that there is no single "best" approach, and an interesting tradeoff space of approaches exists.
Scalable algorithms for signal reconstruction by leveraging similarity joins
TLDR
This paper proposes a dual formulation of the SRP problem and develops the Direct algorithm that is significantly more efficient than the state of the art, and describes a number of practical techniques that allow the algorithm to scale to settings of size in the order of a million by a billion.
...
...

References

SHOWING 1-10 OF 59 REFERENCES
Learning Generalized Linear Models Over Normalized Data
TLDR
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
To Join or Not to Join?: Thinking Twice about Joins before Feature Selection
TLDR
This work identifies the core technical issue that could cause accuracy to decrease in some cases and analyzes this issue theoretically to design easy-to-understand decision rules to predict when it is safe to avoid joins, which led to significant reductions in the runtime of some popular feature selection methods.
Scalable k -Means Clustering via Lightweight Coresets
TLDR
This work provides a single algorithm to construct lightweight coresets for k -means clustering as well as soft and hard Bregman clustering and shows that the proposed algorithm outperforms existing data summarization strategies in practice.
MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis
TLDR
A system called MISTIQUE is proposed that can work with traditional ML pipelines as well as deep neural networks to efficiently capture, store, and query model intermediates for diagnosis and a range of optimizations to reduce storage footprint including quantization, summarization, and data de-duplication are proposed.
Mining Multi-Dimensional Constrained Gradients in Data Cubes
TLDR
An efficient algorithm is developed, which pushes constraints deep into the computation process, finding all gradient-probe cell pairs in one pass, and explores bi-directional pruning between probe cells and gradient cells, utilizing transformed measures and dimensions.
ModelDB: a system for machine learning model management
TLDR
The ongoing work on ModelDB, a novel end-to-end system for the management of machine learning models, is described, which introduces a common layer of abstractions to represent models and pipelines, and the ModelDB frontend allows visual exploration and analyses of models via a web-based interface.
Materialization Optimizations for Feature Selection Workloads
TLDR
It is argued that managing the feature selection process is a pressing data management challenge, and it is shown that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.
A Cost-based Optimizer for Gradient Descent Optimization
TLDR
A cost-based GD optimizer that selects the best GD plan for a given ML task and allows for optimizations that achieve orders of magnitude performance speed-up is proposed.
Scalable Training of Mixture Models via Coresets
TLDR
It is proved that a weighted set of O(dk3/e2) data points suffices for computing a (1 + e)-approximation for the optimal model on the original n data points, which guarantees that models fitting the coreset will also provide a good fit for the original data set.
Join synopses for approximate query answering
TLDR
This paper proposes join synopses as an effective solution for this problem and shows how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins.
...
...