Materialization Optimizations for Feature Selection Workloads
@article{Zhang2016MaterializationOF, title={Materialization Optimizations for Feature Selection Workloads}, author={Ce Zhang and Arun Kumar and Christopher R{\'e}}, journal={ACM Trans. Database Syst.}, year={2016}, volume={41}, pages={2:1-2:32} }
There is an arms race in the data management industry to support statistical analytics. Feature selection, the process of selecting a feature set that will be used to build a statistical model, is widely regarded as the most critical step of statistical analytics. Thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature selection language and a supporting prototype system that builds on top of current…
142 Citations
Processing Analytical Workloads Incrementally
- Computer ScienceArXiv
- 2015
This paper introduces model materialization and incremental model reuse as first class citizens in the execution of analysis workloads and details the details of how to incrementally maintain models as well as outline the suitable optimizations required to optimally use models and their incremental adjustments to build new ones.
To Join or Not to Join?: Thinking Twice about Joins before Feature Selection
- Computer ScienceSIGMOD Conference
- 2016
This work identifies the core technical issue that could cause accuracy to decrease in some cases and analyzes this issue theoretically to design easy-to-understand decision rules to predict when it is safe to avoid joins, which led to significant reductions in the runtime of some popular feature selection methods.
Learning Generalized Linear Models Over Normalized Data
- Computer ScienceSIGMOD Conference
- 2015
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Automatic Database Management System Tuning Through Large-scale Machine Learning
- Computer ScienceSIGMOD Conference
- 2017
An automated approach that leverages past experience and collects new information to tune DBMS configurations and recommends configurations that are as good as or better than ones generated by existing tools or a human expert is presented.
Demonstration of Santoku: Optimizing Machine Learning over Normalized Data
- Computer ScienceProc. VLDB Endow.
- 2015
Santoku applies the idea of factorized learning and automatically decides whether to denormalize or push ML computations through joins through joins, and exploits database dependencies to provide automatic insights that could help analysts with exploratory feature selection.
Enforcing Constraints for Machine Learning Systems via Declarative Feature Selection: An Experimental Study
- Computer ScienceSIGMOD Conference
- 2021
This work proposes Declarative Feature Selection (DFS) to simplify the design and validation of ML systems satisfying diverse user-specified constraints and shows that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand.
A Relational Framework for Classifier Engineering
- Computer ScienceTODS
- 2018
A formal framework for classification in the context of a relational database is proposed to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management.
The optimization for recurring queries in big data analysis system with MapReduce
- Computer ScienceFuture Gener. Comput. Syst.
- 2018
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
- Computer ScienceProc. VLDB Endow.
- 2017
Experimental results on real and synthetic datasets show that the proposed framework achieves improved convergence over HOGWILD! and is the only solution scalable to massive models.
BlockJoin: Efficient Matrix Partitioning Through Joins
- Computer ScienceProc. VLDB Endow.
- 2017
BlockJoin is presented, a distributed join algorithm which directly produces block-partitioned results and applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
References
SHOWING 1-10 OF 49 REFERENCES
Towards a unified architecture for in-RDBMS analytics
- Computer ScienceSIGMOD Conference
- 2012
This work proposes a unified architecture for in-database analytics that requires changes to only a few dozen lines of code to integrate a new statistical technique, and demonstrates the feasibility of this architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS.
The MADlib Analytics Library or MAD Skills, the SQL
- Computer ScienceProc. VLDB Endow.
- 2012
The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided.
Learning Generalized Linear Models Over Normalized Data
- Computer ScienceSIGMOD Conference
- 2015
A new approach named factorized learning is introduced that pushes ML computations through joins and avoids redundancy in both I/O and computations and is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach.
Materialization Optimizations for Feature Selection Workloads
- Computer Science
- 2016
Feature selection, the process of selecting a feature set that will be used to build a statistical model, is the next frontier in data management.
MAD Skills: New Analysis Practices for Big Data
- Computer ScienceProc. VLDB Endow.
- 2009
This paper highlights the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence, and describes database design methodologies that support the agile working style of analysts in these settings.
Oracle Data Mining
- Computer Science
- 2005
The functionality and algorithms behind ODM are described and two examples of the use are described: a SVM methodology for tumor classification and the integration of Naive Bayes predictive models in Oracle’s marketing business application (Oracle Marketing).
Statistics-driven workload modeling for the Cloud
- Computer Science2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010)
- 2010
This paper uses statistical models to predict resource requirements for Cloud computing applications and presents initial design of a workload generator that can be used to evaluate alternative configurations without the overhead of reproducing a real workload.
SystemML: Declarative machine learning on MapReduce
- Computer Science2011 IEEE 27th International Conference on Data Engineering
- 2011
This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
The Volcano optimizer generator: extensibility and efficient search
- Computer ScienceProceedings of IEEE 9th International Conference on Data Engineering
- 1993
The Volcano project, which provides efficient, extensible tools for query and request processing, particularly for object-oriented and scientific database systems, is reviewed, and it is shown that the search engine of the Volcano optimizer generator is more extensible and powerful.
I/O-efficient statistical computing with RIOT
- Computer Science2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)
- 2010
This demo will show how statistical computation can be effectively and efficiently handled by RIOT, a system that extends R, a popular computing environment for statistical data analysis.