Learn More
Bayesian models are generally computed with Markov Chain Monte Carlo (MCMC) methods. The main disadvantage of MCMC methods is the large number of iterations they need to sample the posterior distributions of model parameters, especially for large datasets. On the other hand, variable selection remains a challenging problem due to its combinatorial search(More)
Most research on data mining has proposed algorithms and optimizations that work on flat files, outside a DBMS, mainly due to the following reasons. It is easier to develop efficient algorithms in a traditional programming language. The integration of data mining algorithms into a DBMS is difficult given its relational model foundation and system(More)
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer. In this paper we present the architecture of Orca, the new query(More)
Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and user-defined functions (UDFs). We concentrate on efficient(More)
Relational database systems have been the dominating technology to manage and analyze large data warehouses. Moreover, the ER model, the standard in database design has a close relationship with the relational model. Recently, there has been a surge of alternative technologies for large scale analytic processing, most of which are not based on the(More)
Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we study how to compute PCA in parallel. We extend a previous sequential method to a highly parallel algorithm(More)
Information retrieval techniques have been traditionally exploited outside of relational database systems, due to storage overhead, the complexity of programming them inside the database system, and their slow performance in SQL implementations. This project supports the idea that searching and querying digital libraries with information retrieval models in(More)
Most data mining processing is currently performed on flat files outside the DBMS. We propose novel techniques to process such data mining computations inside the DBMS. We focus on the popular Naive Bayes classification algorithm. In contrast to most approaches, our techniques work completely inside the DBMS, exploiting the DBMS programmability mechanisms(More)
OLAP is a set of database exploratory techniques to efficiently retrieve multiple sets of aggregations from a large dataset. Generally, these techniques have either involved the use of an external OLAP server or required the dataset to be exported to a specialized OLAP tool for more efficient processing. In this work, we show that OLAP techniques can be(More)