Learn More
We consider processing an n × d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions (Liberty, KDD 2013). This algorithm maintains an × d matrix Q deterministically, processing each row in O(dd 2) time; the processing time can be decreased to O(dd) with a slight modification in the algorithm and a constant(More)
We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-update model. The algorithm is presented an arbitrary input matrix A ∈ R n×d one row at a time. It performs O(dd) operations per row and maintains a sketch matrix B ∈ R ×d such that for any k < A T A − B T B 2 2 ≤ A − A k 2 F /(− k) and A − π B k (A) 2 F ≤ 1(More)
Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of " tracking approximations to a matrix " in the distributed streaming model. In this model, there are m(More)
Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This(More)
Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture nonlinear structures within large data sets, and is a central tool in data analysis and learning. To allow for nonlinear relations, typically a full n ⇥ n kernel matrix is constructed over n data points, but this requires too much space and time for large values(More)
This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix <i>A</i><sup><i>n</i> x <i>d</i></sup> one by one in the streaming setting and compute a small sketch <i>B</i> &#8712; <b>R</b><sup><i>l</i> x(More)
Big data is becoming ever more ubiquitous, ranging over massive video repositories, document corpuses, image sets and Internet routing history. Proximity search and clustering are two algorith-mic primitives fundamental to data analysis, but suffer from the " curse of dimensionality " on these gigantic datasets. A popular attack for this problem is to(More)
In order to define a distributed matrix sketching problem thoroughly, one has to specify the distributed model, data model and the partition model of data. The distributed model is often considered as a set of m distributed sites {S 1 , S 2 , · · · , S m } and a central coordinator site C where each site has a two way communication channel with C. Note this(More)
Most matrix approximation techniques (including SVD) provide the basis vectors as linear combinations of data features, e.g. in a term-document matrix a basis vector could be [(3/2) job − (2/7) society + · · · + (1/ √ 10) salary]. Typically, these vectors are not understandable and particularly informative. In addition, analysts spend vast amount of time(More)
All low rank matrix approximation algorithms including the fundamental ones such as Power method or Orthogonal iterations, involve lots of matrix-matrix or matrix-vector multiplications. These basic operations require time proportional to number of non-zero entries in matrices, as one need to read the entire matrix into memory. Sparsifying a matrix, i.e.(More)