# Optimal sampling from sliding windows

@article{Braverman2012OptimalSF, title={Optimal sampling from sliding windows}, author={Vladimir Braverman and Rafail Ostrovsky and Carlo Zaniolo}, journal={J. Comput. Syst. Sci.}, year={2012}, volume={78}, pages={260-272} }

A sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model -- windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n…

## 73 Citations

Stratified random sampling from streaming and stored data

- Computer ScienceDistributed Parallel Databases
- 2021

It is proved that any sliding window-based streaming SRS needs a workspace of $$varOmega (rM\log W)$$ in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window.

Better Sliding Window Algorithms to Maximize Subadditive and Diversity Objectives

- Computer SciencePODS
- 2019

This work describes an alternative approach to designing efficient sliding window algorithms for maximization problems, and instantiates this approach on a wide range of problems, yielding better algorithms for submodular function optimization, diversity optimization and general subadditive optimization.

A Unified Approach for Clustering Problems on Sliding Windows

- Computer ScienceArXiv
- 2015

A data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions is introduced, and it is shown that using only polylogarithmic space the authors can maintain a summary of the current window from which they can construct an O(1)-approximate clustering solution.

Continuous sampling from distributed streams

- Computer ScienceJACM
- 2012

This article presents communication-efficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams, and shows that these protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.

Design of a Sliding Window over Distributed and Asynchronous Event Streams

- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2014

It is proved that the snapshots of the asynchronous event streams within the sliding windows form a convex distributive lattice (denoted by Lat-Win), which enables to easily integrate existing predicate specification and detection techniques, to express and monitor properties of the concern over asynchronousevent streams.

Optimal sampling from distributed streams

- Computer SciencePODS '10
- 2010

This paper presents communication-efficient protocols for sampling (both with and without replacement) from k distributed streams, and shows that they use minimal or near minimal time to process each new item, and space to operate.

Sliding window order statistics in sublinear space

- Computer ScienceArXiv
- 2018

It is proved that the majority statistic on boolean streams cannot be computed in sublinear space, implying that $l^\text{th}$-smallest elements cannot be compute in space both sublinear in $N$ and independent of $l$.

Stream sampling over windows with worst-case optimality and $$\ell $$ℓ-overlap independence

- Computer ScienceThe VLDB Journal
- 2017

This paper proposes a new sampling algorithm that is optimal simultaneously in all the three aspects: space, query time, and update time; it handles an update in O(1) worst-case time with a very small hidden constant.

Symmetric Norm Estimation and Regression on Sliding Windows

- Computer Science, MathematicsCOCOON
- 2021

This work observes that the symmetric norm streaming algorithm of Braverman et al. (STOC 2017) can be reduced to identifying and approximating the frequency of heavy-hitters in a number of substreams, and introduces a heavy-hitter algorithm that gives a (1 + )-approximation to each of the reported frequencies in the sliding window model.

A Survey of Real-Time Big Data Processing Algorithms

- Computer Science
- 2020

A hybrid window mechanism has been introduced in this study which can handle the most recent data stream and variable rate of data stream by sliding window and tumbling window, respectively.

## References

SHOWING 1-10 OF 76 REFERENCES

Sampling from a moving window over streaming data

- Computer ScienceSODA '02
- 2002

This work introduces the problem of sampling from a moving window of recent items from a data stream and develops two algorithms, the first of which, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample and the second, "priority- sample", works even when the number of elements in the window can vary dynamically over time.

Sampling time-based sliding windows in bounded space

- Computer ScienceSIGMOD Conference
- 2008

This paper focuses on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency and it is proved that it is impossible to guarantee a minimum sample size in bounded space.

Distributed streams algorithms for sliding windows

- Computer ScienceSPAA '02
- 2002

Algorithms for estimating aggregate functions over a “sliding window” of the most recent data items in one or more streams are presented and the first ε-approximation scheme for the number of 1’s in a sliding window on the union of distributed streams that uses only logarithmic memory words is presented.

Estimating Rarity and Similarity over Data Stream Windows

- Computer ScienceESA
- 2002

In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations at-(N - 1), at-(N - 2), . . . , at, each ai ? {1, . . . , u};…

Approximate counts and quantiles over sliding windows

- Computer SciencePODS '04
- 2004

This work considers the problem of maintaining ε-approximate counts and quantiles over a stream sliding window using limited space and presents various deterministic and randomized algorithms for approximate counts andquantiles that require O(1/ε polylog( 1/ε, N)) space.

Maintaining variance and k-medians over data stream windows

- Computer SciencePODS '03
- 2003

A novel technique is presented for solving two important and related problems in the sliding window model---maintaining variance and maintaining a <i>k</i>--median clustering and a constant-factor approximation algorithm is presented.

Maintaining significant stream statistics over sliding windows

- Computer ScienceSODA '06
- 2006

It is proved that any data structure for the Significant One Counting problem must use at least Ω(1/ε log<sup>2</sup> 1/θ + log ε θ<i>n</i>) bits of memory.

Moment: maintaining closed frequent itemsets over a stream sliding window

- Computer ScienceFourth IEEE International Conference on Data Mining (ICDM'04)
- 2004

A compact data structure, the closed enumeration tree (CET), is introduced, to maintain a dynamically selected set of item-sets over a sliding-window that consists of a boundary between closed frequent itemsets and the rest of the itemsets.

Identifying frequent items in sliding windows over on-line packet streams

- Computer ScienceIMC '03
- 2003

This paper presents a deterministic algorithm for identifying frequent items in sliding windows defined over real-time packet streams that uses limited memory, requires constant processing time per packet, makes only one pass over the data, and is shown to work well when tested on TCP traffic logs.

Maintaining stream statistics over sliding windows: (extended abstract)

- Computer ScienceSODA '02
- 2002

Using the algorithm for the basic counting problem, one can adapt many other techniques to work for the sliding window model, with a multiplicative overhead of 1/εlog <i>N</i>) in memory and a 1 + ε factor loss in accuracy.