SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper, we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including… (More)

—Data summarization is an essential mechanism to accelerate analytic algorithms on large data sets. On the other hand, array DBMSs enable scalable computation with large matrices. With that motivation in mind, we propose a parallel array operator, based on a specific form of matrix multiplication, that computes a comprehensive data summarization matrix. By… (More)

We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear Gaussian models, widely used in big data analytics. Computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and… (More)

