• Publications
  • Influence
Running Experiments on Amazon Mechanical Turk
textabstractAlthough Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited
Duplicate Record Detection: A Survey
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.
Demographics of Mechanical Turk
We present the results of a survey that collected information about the demographics of participants on Amazon Mechanical Turk, together with information about their level of activity and motivation
Analyzing the Amazon Mechanical Turk marketplace
An associate professor at New York Universitys Stern School of Business uncovers answers about who are the employers in paid crowdsourcing, what tasks they post, and how much they pay.
Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics
This paper is the first study that integrates econometric, text mining, and predictive modeling techniques toward a more complete analysis of the information captured by user-generated online reviews in order to estimate their helpfulness and economic impact.
Quality management on Amazon Mechanical Turk
This work presents algorithms that improve the existing state-of-the-art techniques, enabling the separation of bias and error, and illustrates how to incorporate cost-sensitive classification errors in the overall framework and how to seamlessly integrate unsupervised and supervised techniques for inferring the quality of the workers.
Get another label? improving data quality and data mining using multiple, noisy labelers
The results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
Approximate String Joins in a Database (Almost) for Free
This paper develops a technique for building approximate string join capabilities on top of commercial databases by exploiting facilities already available in them, and demonstrates experimentally the benefits of the technique over the direct use of UDFs.
Deriving the Pricing Power of Product Features by Mining Consumer Reviews
It is argued that product reviews are multifaceted, and hence the textual content of product reviews is an important determinant of consumers' choices, over and above the valence and volume of reviews.
Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection
It is found that learning algorithms are surprisingly robust to annotation errors and this level of training data corruption can lead to an acceptably small increase in test error if the training set has sufficient size.