Learn More
We consider processing an n × d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions (Liberty, KDD 2013). This algorithm maintains an ℓ × d matrix Q deterministically, processing each row in O(dℓ 2) time; the processing time can be decreased to O(dℓ) with a slight modification in the algorithm and a constant(More)
We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A ∈ R n×d one row at a time. It performed O(dd) operations per row and maintains a sketch matrix B ∈ R ×d such that for any k < A T A − B T B 2 2 ≤ A − A k 2 F /(− k) and A − π B k (A) 2 F ≤(More)
Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This(More)
This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix <i>A</i><sup><i>n</i> x <i>d</i></sup> one by one in the streaming setting and compute a small sketch <i>B</i> &#8712; <b>R</b><sup><i>l</i> x(More)
Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of " tracking approximations to a matrix " in the distributed streaming model. In this model, there are m(More)
Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture nonlinear structures within large data sets, and is a central tool in data analysis and learning. To allow for nonlinear relations, typically a full n ⇥ n kernel matrix is constructed over n data points, but this requires too much space and time for large values(More)
Big data is becoming ever more ubiquitous, ranging over massive video repositories, document corpuses, image sets and Internet routing history. Proximity search and clustering are two algorith-mic primitives fundamental to data analysis, but suffer from the " curse of dimensionality " on these gigantic datasets. A popular attack for this problem is to(More)
Document indexing using dimension reduction has been widely studied in recent years. Application of these methods in large distributed systems may be inefficient due to the required computational, storage, and communication costs. In this paper, we propose DLPR, a distributed locality preserving dimension reduction algorithm, to project a large distributed(More)
  • 1