This paper considers the challenge of evaluating a set of classifiers, as done in shared task evaluations like the KDD Cup or NIST TREC, without expert labels. While expert labels provide the traditional cornerstone for evaluating statistical learners, limited or expensive access to experts represents a practical bottleneck. Instead, we seek methodology for estimating performance of the classifiers (relative and absolute) which is more scalable than expert labeling yet preserves high correlation with evaluation based on expert labels. We consider both: 1) using only labels automatically generated by the classifiers themselves (blind evaluation); and 2) using labels obtained via crowdsourcing. While crowdsourcing methods are lauded for scalability, using such data for evaluation raises serious concerns given the prevalence of label noise. In regard to blind evaluation, two broad strategies are investigated: combine & score and score & combine. Combine & Score methods infer a single “pseudo-gold” label set by aggregating classifier labels; classifiers are then evaluated based on this single pseudo-gold label set. On the other hand, score & combine methods: i) sample multiple label sets from classifier outputs, ii) evaluate classifiers on each label set, and iii) average classifier performance across label sets. When additional crowd labels are also collected, we investigate two alternative avenues for exploiting them: 1) direct evaluation of classifiers; or 2) supervision of combine-and-score methods. To assess generality of our techniques, classifier performance is measured using four common classification metrics, with statistical significance tests establishing relative performance of the classifiers for each metric. Finally, we measure both score and rank correlations between estimated classifier performance vs. actual performance according to expert judgments. Rigorous evaluation of classifiers from the TREC 2011 Crowdsourcing Track shows reliable evaluation can be achieved without reliance on expert labels. Hyun Joon Jung School of Information University of Texas at Austin E-mail: hyunJoon@utexas.edu Matthew Lease School of Information University of Texas at Austin E-mail: email@example.com ar X iv :1 21 2. 09 60 v1 [ cs .L G ] 5 D ec 2 01 2 2 Hyun Joon Jung, Matthew Lease Fig. 1 Our experimental framework used. As input, K binary classifiers each label M examples. (a) As ground truth, classifiers are scored for several metrics based on expert judgments, statistical significance of differences is computed, and classifiers are ranked (best to worst). (b) An estimation method p is used to predict classifier scores without expert judgments, and classifiers are ranked accordingly to estimated scores (score differences which are not statistically significant yield tied rankings). Score and rank correlation is then measured between estimated vs. actual scores and ranks (c1). (b’) A second, alternative method q is used to estimate classifier scores, classifiers are ranked accordingly, and correlation of scores and ranks vs. ground truth is measured (c2). Finally, we compare the correlations c1 and c2 to determine whether p or q achieved the greatest score and rank correlation (with statistical significance).