Materialization Optimizations for Feature Selection Workloads

@article{Zhang2016MaterializationOF,
  title={Materialization Optimizations for Feature Selection Workloads},
  author={Ce Zhang and Arun Kumar and Christopher R{\'e}},
  journal={ACM Trans. Database Syst.},
  year={2016},
  volume={41},
  pages={2:1-2:32}
}
There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current… 

Figures from this paper

Processing Analytical Workloads Incrementally
TLDR
This paper introduces model materialization and incremental model reuse as first class citizens in the execution of analysis workloads and details the details of how to incrementally maintain models as well as outline the suitable optimizations required to optimally use models and their incremental adjustments to build new ones.
To Join or Not to Join?: Thinking Twice about Joins before Feature Selection
TLDR
This work identifies the core technical issue that could cause accuracy to decrease in some cases and analyzes this issue theoretically to design easy-to-understand decision rules to predict when it is safe to avoid joins, which led to significant reductions in the runtime of some popular feature selection methods.
Learning Generalized Linear Models Over Normalized Data
TLDR
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Automatic Database Management System Tuning Through Large-scale Machine Learning
TLDR
An automated approach that leverages past experience and collects new information to tune DBMS configurations and recommends configurations that are as good as or better than ones generated by existing tools or a human expert is presented.
Demonstration of Santoku: Optimizing Machine Learning over Normalized Data
TLDR
Santoku applies the idea of factorized learning and automatically decides whether to denormalize or push ML computations through joins through joins, and exploits database dependencies to provide automatic insights that could help analysts with exploratory feature selection.
Enforcing Constraints for Machine Learning Systems via Declarative Feature Selection: An Experimental Study
TLDR
This work proposes Declarative Feature Selection (DFS) to simplify the design and validation of ML systems satisfying diverse user-specified constraints and shows that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand.
A Relational Framework for Classifier Engineering
TLDR
A formal framework for classification in the context of a relational database is proposed to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management.
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
TLDR
Experimental results on real and synthetic datasets show that the proposed framework achieves improved convergence over HOGWILD! and is the only solution scalable to massive models.
BlockJoin: Efficient Matrix Partitioning Through Joins
TLDR
BlockJoin is presented, a distributed join algorithm which directly produces block-partitioned results and applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Towards a unified architecture for in-RDBMS analytics
TLDR
This work proposes a unified architecture for in-database analytics that requires changes to only a few dozen lines of code to integrate a new statistical technique, and demonstrates the feasibility of this architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS.
The MADlib Analytics Library or MAD Skills, the SQL
TLDR
The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided.
Learning Generalized Linear Models Over Normalized Data
TLDR
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Materialization Optimizations for Feature Selection Workloads
TLDR
Feature selection, the process of selecting a feature set that will be used to build a statistical model, is the next frontier in data management.
MAD Skills: New Analysis Practices for Big Data
TLDR
This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings.
Oracle Data Mining
TLDR
The functionality and algorithms behind ODM are described and two examples of the use are described: a SVM methodology for tumor classification and the integration of Naive Bayes predictive models in Oracle’s marketing business application (Oracle Marketing).
Statistics-driven workload modeling for the Cloud
TLDR
This paper uses statistical models to predict resource requirements for Cloud computing applications and presents initial design of a workload generator that can be used to evaluate alternative configurations without the overhead of reproducing a real workload.
SystemML: Declarative machine learning on MapReduce
TLDR
This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
The Volcano optimizer generator: extensibility and efficient search
TLDR
The Volcano project, which provides efficient, extensible tools for query and request processing, particularly for object-oriented and scientific database systems, is reviewed, and it is shown that the search engine of the Volcano optimizer generator is more extensible and powerful.
I/O-efficient statistical computing with RIOT
TLDR
This demo will show how statistical computation can be effectively and efficiently handled by RIOT, a system that extends R, a popular computing environment for statistical data analysis.
...
1
2
3
4
5
...