Learn More
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. Our technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with(More)
We propose an approximate integrated approach for solving both problems of finding the most popular <i>k</i> elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and top-<i>k</i> elements with tight guarantees on errors. For general data distributions, our top-<i>k</i>(More)
Data outsourcing or database as a service is a new paradigm for data management in which a third party service provider hosts a database as a service. The service provides data management for its customers and thus obviates the need for the service user to purchase expensive hardware and software, deal with software upgrades and hire professionals for(More)
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, mul-tisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage(More)
Data cube computation and representation are prohibitively expensive in terms of time and space. Prior work has focused on either reducing the computation time or condensing the representation of a data cube. We introduce range cubing as an efficient way to compute and compress the data cube without any loss of precision. A new data structure, range trie,(More)
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal(More)
Discovering associations between elements occurring in a stream is applicable in numerous applications, including predictive caching and fraud detection. These applications require a new model of association between pairs of elements in streams. We develop an algorithm, Streaming-Rules, to report association rules with tight guarantees on errors, using(More)
Click fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well being of e-commerce. Moreover, advertising supports the intellectual value of the Internet by covering the running expenses of(More)