• Publications
  • Influence
Locality-sensitive hashing scheme based on p-stable distributions
A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1. Expand
Models and issues in data stream systems
The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated. Expand
Google news personalization: scalable online collaborative filtering
This paper describes the approach to collaborative filtering for generating personalized recommendations for users of Google News using MinHash clustering, Probabilistic Latent Semantic Indexing, and covisitation counts, and combines recommendations from different algorithms using a linear model. Expand
Maintaining Stream Statistics over Sliding Windows
The problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far, is considered, and it is shown that, using $O(\frac{1}{\epsilon} \log^2 N)$ bits of memory, the number of 1's can be estimated to within a factor of $1 + \ep silon$. Expand
STREAM: The Stanford Stream Data Manager
Finding interesting associations without support pruning
  • E. Cohen, Mayur Datar, +5 authors Cheng Yang
  • Computer Science
  • Proceedings of 16th International Conference on…
  • 29 February 2000
This work develops a family of algorithms for solving association rule mining, employing a combination of random sampling and hashing techniques and provides an analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis. Expand
Load shedding for aggregation queries over data streams
Focusing on aggregation queries, algorithms that determine at what points in a query plan should load shedding be performed and what amount of load should be shed at each point in order to minimize the degree of inaccuracy introduced into query answers are presented. Expand
Sampling from a moving window over streaming data
This work introduces the problem of sampling from a moving window of recent items from a data stream and develops two algorithms, the first of which, "chain-sample", extends reservoir sampling to deal with the expiration of data elements from the sample and the second, "priority- sample", works even when the number of elements in the window can vary dynamically over time. Expand
Query Processing, Approximation, and Resource Management in a Data Stream Management System
This paper describes our ongoing work developing the Stanford Stream Data Manager (STREAM), a system for executing continuous queries over multiple continuous data streams. The STREAM system supportsExpand
STREAM: The Stanford Data Stream Management System
A general-purpose prototype Data Stream Management System (DSMS), also called STREAM, is built that supports a large class of declarative continuous queries over continuous streams and traditional stored data sets. Expand