Mina Ghashami

Learn More
We consider processing an n× d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions (Liberty, KDD 2013). This algorithm maintains an l× d matrix Q deterministically, processing each row in O(dl) time; the processing time can be decreased to O(dl) with a slight modification in the algorithm and a constant(More)
We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix A ∈ Rn×d one row at a time. It performed O(d`) operations per row and maintains a sketch matrix B ∈ R`×d such that for any k < ` ‖AA−BB‖2 ≤ ‖A−Ak‖F /(`− k) and ‖A− πBk(A)‖F ≤ ( 1 + k `− k )(More)
Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This(More)
Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of “tracking approximations to a matrix” in the distributed streaming model. In this model, there are m(More)
Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture nonlinear structures within large data sets, and is a central tool in data analysis and learning. To allow for nonlinear relations, typically a full n ⇥ n kernel matrix is constructed over n data points, but this requires too much space and time for large values(More)
Big data is becoming ever more ubiquitous, ranging over massive video repositories, document corpuses, image sets and Internet routing history. Proximity search and clustering are two algorithmic primitives fundamental to data analysis, but suffer from the “curse of dimensionality” on these gigantic datasets. A popular attack for this problem is to convert(More)
Document indexing using dimension reduction has been widely studied in recent years. Application of these methods in large distributed systems may be inefficient due to the required computational, storage, and communication costs. In this paper, we propose DLPR, a distributed locality preserving dimension reduction algorithm, to project a large distributed(More)
  • 1