• Corpus ID: 13345461

Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs ∗

  title={Fast PCA and Bayesian Variable Selection for Large Data Sets Based on SQL and UDFs ∗},
  author={Mario Navas and Carlos Ordonez},
Large amounts of data are stored in relational DBMSs. However, statistical analysis is frequently performed outside the DBMS using statistical tools, such as the well-known R package, leading to slow processing when data sets cannot fit in main memory and going through a file export bottleneck. In this article, we propose algorithms for large data set processing of principal component analysis (PCA) and stochastic search variable selection (SSVS) that can work entirely inside a DBMS, using SQL… 

Figures and Tables from this paper

Bayesian variable selection for linear regression in high dimensional microarray data
This work presents several algorithmic optimizations to accelerate the MCMC method to make it work efficiently inside a database system, and shows a DBMS is a promising platform to analyze microarray data.


Statistical Model Computation with UDFs
  • C. Ordonez
  • Computer Science
    IEEE Transactions on Knowledge and Data Engineering
  • 2010
This work introduces techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs), and studies the computation of linear regression, PCA, clustering, and Naive Bayes.
Building statistical models and scoring with UDFs
The techniques described herein are used in a commercial data mining tool, called Teradata Warehouse Miner, and explain how correlation, linear regression, PCA and clustering, are integrated into the Teradata DBMS.
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling
An aggregate UDF computing multidimensional sufficient statistics that benefit a broad array of statistical models: the linear sum of points and the quadratic sum of cross-products of point dimensions is presented.
On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases
This work presents a new SQL operator (Unpivot) that enables efficient gathering of statistics with minimal changes to the SQL backend and shows analytically how this approach outperforms an alternative that requires changing in the data layout.
Efficient computation of PCA with SVD in SQL
This work proposes a solution that combines a summarization of the data set with the correlation or covariance matrix and then solve PCA with Singular Value Decomposition (SVD).
ATLAS: A Small but Complete SQL Extension for Data Mining and Data Streams
On parallel processing of aggregate and scalar functions in object-relational DBMS
A framework that allows to process user-defined functions with data parallelism, and describes the class of partitionable functions that can be processed parallelly, and proposes an extension which allows to speed up the processing of another large class of functions by means of parallel sorting.
Map-reduce-merge: simplified relational data processing on large clusters
A Merge phase is added to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted by map and reduce modules, and it is demonstrated that this new model can express relational algebra operators as well as implement several join algorithms.
.NET database programmability and extensibility in microsoft SQL server
This work describes the extensibility contracts for user-defined types and aggregates in detail and presents the advances to the CLR integration in SQL Server 2008 which significantly enhances the breath of applications supported by SQL Server.
User-defined aggregate functions: bridging theory and practice
This paper considers query optimization, query rewriting and view maintenance for queries with UDAs, and presents theoretical and practical insights that can be combined to derive a coherent framework for defining UDAs within a database system.