Akash Das Sarma

Learn More
Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution , and distributes(More)
In pay-per click sponsored search auctions which are currently extensively used by search engines, the auction for a keyword involves a certain number of advertisers (say k) competing for available slots (say m) to display their ads. This auction is typically conducted for a number of rounds (say T). There are click probabilities µij associated with each(More)
In pay-per-click sponsored search auctions which are currently extensively used by search engines, the auction for a keyword involves a certain number of advertisers (say k) competing for available slots (say m) to display their advertisements (ads for short). A sponsored search auction for a keyword is typically conducted for a number of rounds (say T).(More)
Counting objects is a fundamental image processisng primitive, and has many scientific, health, surveillance, security, and military applications. Existing supervised computer vision techniques typically require large quantities of labeled training data, and even with that, fail to return accurate results in all but the most stylized settings. Using vanilla(More)
We consider the problem of defining, generating, and tracing provenance in data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output results. We first give a new general definition of provenance for general transformations, introducing the notions of correctness, precision, and minimality. We then(More)
We study crowdsourcing quality management, that is, given worker responses to a set of tasks, our goal is to jointly estimate the true answers for the tasks, as well as the quality of the workers. Prior work on this problem relies primarily on applying Expectation-Maximization (EM) on the underlying maximum likelihood problem to estimate true answers as(More)
Crowdsourcing is the primary means to generate training data at scale, and when combined with sophisticated machine learning algorithms, crowdsourcing is an enabler for a variety of emergent automated applications impacting all spheres of our lives. This paper surveys the emerging field of formally reasoning about and optimizing open-ended crowdsourcing, a(More)