Learn More
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with(More)
We propose an approximate integrated approach for solving both problems of finding the most popular <i>k</i> elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-<i>k</i> elements with tight guarantees on errors. For general data distributions, our top-<i>k</i>(More)
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, mul-tisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage(More)
Data cube computation and representation are prohibitively expensive in terms of time and space. Prior work has focused on either reducing the computation time or condensing the representation of a data cube. In this paper, we introduce Range Cubing as an efficient way to compute and compress the data cube without any loss of precision. A new data(More)
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal(More)
Click fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well being of e-commerce. Moreover, advertising supports the intellectual value of the Internet by covering the running expenses of(More)
Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, Streaming-Rules, to report association rules with tight guarantees on errors, using(More)
The primary goal of data stream research is to develop space and time efficient solutions for answering continuous on-line summarization queries. Research efforts over the last decade have resulted in a number of efficient algorithms with varying degrees of space and time complexities. While these techniques are developed in a standard CPU setting, many of(More)