Matrix Sketching Over Sliding Windows

@article{Wei2016MatrixSO,
  title={Matrix Sketching Over Sliding Windows},
  author={Zhewei Wei and Xuancheng Liu and Feifei Li and Shuo Shang and Xiaoyong Du and Ji-Rong Wen},
  journal={Proceedings of the 2016 International Conference on Management of Data},
  year={2016}
}
Large-scale matrix computation becomes essential for many data data applications, and hence the problem of sketching matrix with small space and high precision has received extensive study for the past few years. This problem is often considered in the row-update streaming model, where the data set is a matrix A -- Rn x d, and the processor receives a row (1 x d) of A at each timestamp. The goal is to maintain a smaller matrix (termed approximation matrix, or simply approximation) B -- Rl x d… 

Figures and Tables from this paper

Tracking Matrix Approximation over Distributed Sliding Windows
TLDR
This paper proposes sampling-based algorithms that continuously track a weighted sample of rows according to their squared norms, which generalize and simplify the sampling techniques in [2], and deterministic tracking algorithms that require only one-way communication and provide better error guarantee.
Efficient Matrix Sketching over Distributed Data
TLDR
This paper considers the problem of computing a sketch of a massive data matrix A ∈ℜnxd, which is distributed across a large number of s servers and gives a new algorithm for distributed PCA with improved communication cost.
Near Optimal Linear Algebra in the Online and Sliding Window Models
TLDR
A unified row-sampling based framework that gives randomized algorithms for spectral approximation, low-rank approximation/projection-cost preservation, and $\ell_{1}$-subspace embeddings in the sliding window model, which often use nearly optimal space and achieve nearly input sparsity runtime.
Matrix Norms in Data Streams: Faster, Multi-Pass and Row-Order
TLDR
A number of aspects of estimating matrix norms in a stream that have not previously been considered are considered, and a near-complete characterization of the memory required of row-order algorithms for estimating Schatten-norms of sparse matrices is obtained.
Smoothness of Schatten Norms and Sliding-Window Matrix Streams
Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA
TLDR
This paper proves an almost tight deterministic communication lower bound, then provides a new randomized algorithm with communication cost smaller than the deterministic lower bound and gives an improved distributed PCA algorithm for sparse input matrices, which uses the distributed sketching algorithm as a key building block.
Truly Perfect Samplers for Data Streams and Sliding Windows
TLDR
This work shows that sublinear space truly perfect sampling is impossible in the turnstile model, and proves a lower bound of Ω(min(n, log 1/γ) for any G-sampler with point-wise error γ from the true distribution, and gives a general time-efficient sublinear-space framework for developing truly perfect samplers in the insertion-only streaming and sliding window models.
Symmetric Norm Estimation and Regression on Sliding Windows
TLDR
This work observes that the symmetric norm streaming algorithm of Braverman et al. (STOC 2017) can be reduced to identifying and approximating the frequency of heavy-hitters in a number of substreams, and introduces a heavy-hitter algorithm that gives a (1 + )-approximation to each of the reported frequencies in the sliding window model.
Sketches for Matrix Norms: Faster, Smaller and More General
TLDR
It is proved that one can obtain an approximation to $l(A)$ from a sketch $GAH^T$ where $G$ and $H$ are independent Oblivious Subspace Embeddings and the dimension of the sketch is polynomial in the intrinsic dimension of $A$.
Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators
TLDR
The results show there is no separation between the sliding window model and the standard data stream model in terms of the approximation factor, and the first difference estimators for a wide range of problems are developed.
...
...

References

SHOWING 1-10 OF 44 REFERENCES
Continuous Matrix Approximation on Distributed Data
TLDR
Novel algorithms to address the matrix approximation problem of "tracking approximations to a matrix" in the distributed streaming model are presented and extensive experiments with real large datasets demonstrate the efficiency of these protocols.
Sketching distributed sliding-window data streams
TLDR
This work introduces a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees and is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.
Sampling time-based sliding windows in bounded space
TLDR
This paper focuses on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency and it is proved that it is impossible to guarantee a minimum sample size in bounded space.
Sampling from a moving window over streaming data
TLDR
This work introduces the problem of sampling from a moving window of recent items from a data stream and develops two algorithms, the first of which, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample and the second, "priority- sample", works even when the number of elements in the window can vary dynamically over time.
Maintaining Stream Statistics over Sliding Windows
TLDR
The problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far, is considered, and it is shown that, using $O(\frac{1}{\epsilon} \log^2 N)$ bits of memory, the number of 1's can be estimated to within a factor of $1 + \ep silon$.
Maintaining sliding window skylines on data streams
TLDR
This paper proposes algorithms that continuously monitor the incoming data and maintain the skyline incrementally, and utilizes several interesting properties of stream skylines to improve space/time efficiency by expunging data from the system as early as possible (i.e., before their expiration).
Continuous sampling from distributed streams
TLDR
This article presents communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams, and shows that these protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.
Improved Practical Matrix Sketching with Guarantees
TLDR
This paper attempts to categorize and compare the most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees.
Relative Errors for Deterministic Low-Rank Matrix Approximations
TLDR
It is shown that Frequent Directions cannot be adapted to a sparse version in an obvious way that retains the l original rows of the matrix, as opposed to a linear combination or sketch of the rows.
Improved Approximation Algorithms for Large Matrices via Random Projections
  • Tamás Sarlós
  • Computer Science
    2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)
  • 2006
TLDR
The key idea is that low dimensional embeddings can be used to eliminate data dependence and provide more versatile, linear time pass efficient matrix computation.
...
...