Akash Das Sarma

Learn More
Similarity join is the problem of finding pairs of records with similarity score greater than some threshold. In this paper we study the problem of scaling up similarity join for different metric distance functions using MapReduce. We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution , and distributes(More)
Counting objects is a fundamental image processisng primitive, and has many scientific, health, surveillance, security, and military applications. Existing supervised computer vision techniques typically require large quantities of labeled training data, and even with that, fail to return accurate results in all but the most stylized settings. Using vanilla(More)
In pay-per click sponsored search auctions which are currently extensively used by search engines, the auction for a keyword involves a certain number of advertisers (say k) competing for available slots (say m) to display their ads. This auction is typically conducted for a number of rounds (say T). There are click probabilities µij associated with each(More)
We study crowdsourcing quality management, that is, given worker responses to a set of tasks, our goal is to jointly estimate the true answers for the tasks, as well as the quality of the workers. Prior work on this problem relies primarily on applying Expectation-Maximization (EM) on the underlying maximum likelihood problem to estimate true answers as(More)
We consider the problem of defining, generating, and tracing provenance in data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output results. We first give a new general definition of provenance for general transformations, introducing the notions of correctness, precision, and minimality. We then(More)
In pay-per-click sponsored search auctions which are currently extensively used by search engines, the auction for a keyword involves a certain number of advertisers (say k) competing for available slots (say m) to display their advertisements (ads for short). A sponsored search auction for a keyword is typically conducted for a number of rounds (say T).(More)
We consider the problem of defining, generating, and tracing provenance in data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output results. We first give a new general definition of provenance for general transformations, introducing the notions of correctness, precision, and minimality. We then(More)
We conduct an experimental analysis of a dataset comprising over 27 million microtasks performed by over 70,000 workers issued to a large crowdsourcing marketplace between 2012-2016. Using this data—never before analyzed in an academic context—we shed light on three crucial aspects of crowdsourcing: (1) Task design — helping requesters understand what(More)